DEV Community

Cover image for Processing a 2GB CSV in Node Without Running Out of Memory
Coded Parts
Coded Parts

Posted on

Processing a 2GB CSV in Node Without Running Out of Memory

Why the obvious approach crashes, and how a few generator functions keep memory flat no matter how big the file gets.

Here's a task that looks trivial on paper: Read a CSV export, filter the rows you care about, sum one column, write a small report. The kind of thing you bang out in ten minutes. Now say the file is around 2GB.

The first version is four lines. It works great on a 5MB sample. Then you point it at the real export and Node falls over with JavaScript heap out of memory. The reflex is to do what most of us do first, bump --max-old-space-size, give it more heap, run it again. It gets further and dies again. That's the moment to stop fighting the symptom and look at what the code is actually asking the machine to do.

Here is the thing worth internalizing: the size of your data does not have to dictate the size of your memory footprint. You can process a file bigger than your RAM. The trick is to never hold the whole thing at once, and generators give you a clean way to write code that does exactly that without turning into a mess of callbacks and manual state.

Let's build up to it properly.

The version that dies

Here's roughly what the first attempt looked like:

const fs = require('fs');

const rows = fs.readFileSync('export.csv', 'utf8').split('\n');

let total = 0;
for (const row of rows) {
  const amount = Number(row.split(',')[2]);
  if (!Number.isNaN(amount)) total += amount;
}

console.log('total:', total);
Enter fullscreen mode Exit fullscreen mode

Read the file. Split on newlines. Loop. Sum. Clean and readable, and on a small file it's perfect.

The problem is hiding in the first line, and it's actually two problems stacked on top of each other.

fs.readFileSync pulls the entire file into memory as one big buffer before you do anything with it. A 2GB file is a 2GB allocation, minimum. Then .split('\n') takes that buffer and produces an array with one string per line. For a file with millions of rows, that's millions of string objects, each with its own overhead, all alive at the same time. So now you're holding the raw file and a fully expanded array of every line. You've roughly doubled the cost of the thing that was already too big.

I wanted to see how bad it actually is, so I ran it. I generated a CSV with 2 million rows (id,name,amount), which came out to about 45MB. Modest. Not even close to 2GB. Here's what the load-everything approach did to memory:

naive sum: 999000000 | peak RSS MB: 238
Enter fullscreen mode Exit fullscreen mode

238MB of resident memory to process a 45MB file. That's more than five times the file size sitting in RAM at peak. Now scale that ratio up. A 2GB file with the same shape would want somewhere north of 10GB, and your container almost certainly does not have that. Hence the crash.

What we actually want

Step back from the code for a second.

To sum a column, do you ever genuinely need every row in memory simultaneously? No. You need one row at a time. Read a line, pull out the number, add it to a running total, throw the line away, move on. At no point does row 1,400,000 need to coexist with row 3.

That's the whole insight. The work is sequential and one-pass, so the memory should be too. We want to pull rows through the program one at a time, like water through a pipe, instead of trying to fill an entire Ocean in a bucket.

Node has had streams forever, and streams do exactly this. But raw streams are awkward to compose. The moment you want to chain "read lines" into "parse them" into "filter them" into "sum them," you're wiring up event handlers and managing backpressure by hand, and the readable four-line version turns into something you don't want to look at.

This is where generators earn their place.

Generators, the one-paragraph version

A normal function runs start to finish and returns once. A generator function (the function* syntax) can pause itself partway through, hand a value back to whoever called it, and then resume from exactly where it left off the next time you ask for a value. It does this with yield.

For reading files we want the async flavor, async function*, because reading from disk is asynchronous. The consuming side uses for await...of instead of a plain for...of. Same idea, just async.

Building the pipeline

Let's write the big-file version as a set of small generators, each doing one job.

First, a generator that yields the file one line at a time. Node's readline module already reads a stream line by line, so we wrap it:

const fs = require('fs');
const readline = require('readline');

async function* readLines(path) {
  const rl = readline.createInterface({
    input: fs.createReadStream(path),
    crlfDelay: Infinity,
  });
  for await (const line of rl) {
    yield line;
  }
}
Enter fullscreen mode Exit fullscreen mode

createReadStream reads the file in small chunks rather than all at once. readline hands us complete lines off those chunks. We yield each line as it arrives. Crucially, nothing is accumulating here. A line comes in, goes out, and is gone.

Next, a generator that turns raw lines into parsed objects:

async function* parse(lines) {
  for await (const line of lines) {
    const [id, name, amount] = line.split(',');
    if (id === 'id') continue; // skip the header row
    yield { id, name, amount: Number(amount) };
  }
}
Enter fullscreen mode Exit fullscreen mode

Notice it takes a source of lines as its input and yields objects. It doesn't know or care whether those lines came from a file, a network socket, or an array in a test. It just transforms what flows through it.

Now a filter, because in this scenario, I only wanted rows above a threshold:

async function* onlyAbove(rows, min) {
  for await (const row of rows) {
    if (row.amount >= min) {
      yield row;
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

And finally we connect them and consume the result:

(async () => {
  const lines = readLines('export.csv');
  const parsed = parse(lines);
  const filtered = onlyAbove(parsed, 0);

  let total = 0;
  let count = 0;
  for await (const row of filtered) {
    total += row.amount;
    count++;
  }

  console.log('total:', total, 'count:', count);
})();
Enter fullscreen mode Exit fullscreen mode

Read it from the inside out. readLines produces lines, parse consumes those and produces objects, onlyAbove consumes those and produces a filtered subset, and the for await loop at the bottom pulls the whole chain. Each stage is maybe five lines. Each one does a single thing. You can test them in isolation, reorder them, drop one in or out, all without touching the others.

Here's the part that matters. I ran this exact pipeline against the same 2 million row file:

pipeline sum: 999000000 count: 2000000 | peak RSS MB: 89
Enter fullscreen mode Exit fullscreen mode

Same answer, 999000000, down to the last digit. But peak memory went from 238MB to 89MB. And that 89MB is not really "memory for the data." It's Node's baseline plus the read buffer plus a couple of objects in flight. The data itself is barely there because we only ever hold one row at a time. Throw a 2GB file at this and the number stays flat. That's the whole game.

Why this composes when streams alone don't

You might be thinking, fine, but Node streams could do this too, and you'd be right. So why the generators?

Pull versus push. A raw readable stream pushes data at you through events; you react to 'data' and 'end' and you manage the timing yourself. When you chain several transformations, you're coordinating several event emitters and making sure none of them races ahead of a slow consumer. Backpressure, in the jargon.

Generators flip it to pull. The consumer at the bottom of the loop asks for the next value, and that request travels back up the chain. onlyAbove asks parse for a row, parse asks readLines for a line, readLines asks the file for a chunk. Nothing is produced until something downstream wants it. Backpressure isn't something you configure; it's just how yield works. The producer literally cannot get ahead because it's paused until you call for the next value.

That's why the four small functions above read almost like the naive version, but behave like a carefully tuned stream. You get the readability of the simple loop and the memory profile of hand-written streaming, without choosing between them.

Where this bites you

I'd be lying if I said this is free.

The big one: you get one pass. A generator is exhausted once you've iterated it. If you need to loop over the data twice, say, sum a column and then also find the max in a separate pass, you can't just iterate the same pipeline again. It's empty the second time. You either compute both in a single pass, or you re-create the pipeline from the source, or, if the result genuinely fits in memory, you collect it into an array (const arr = []; for await (const x of pipe) arr.push(x);) and accept the cost. The streaming approach is for when the dataset doesn't fit, so collecting it usually defeats the point.

The other one is debugging. With an array you can console.log the whole thing and see your data. With a lazy pipeline there's nothing to log until you pull a value through, and a console.log inside a generator only fires when that value is actually demanded. The execution order can surprise you the first few times. It clicks, but there's an adjustment period.

And async generators do carry some per-iteration overhead compared to a tight synchronous loop over an array. If your data comfortably fits in memory and you care about raw speed, the array might genuinely be faster. This technique is about not dying on data that doesn't fit, not about winning microbenchmarks on data that does.

The bit underneath

What I find quietly interesting is that the for await...of loop driving this whole thing is doing something generators were partly built to enable. The pause-and-resume machinery that lets a generator give up control and pick back up later is the same machinery that async/await is built on top of. When you await a promise, your function is effectively yielding control and waiting to be resumed, exactly like a generator yielding a value. async/await is, more or less, a generator and a runner that feeds it resolved promises. Once you've written a few generators by hand, a lot of the async behavior you've been taking on faith stops being magic.

I dug into that whole layer, the two-way communication, yield* composition, the async runner that became async/await, in a short book on generators. It's free. If the pipeline pattern here was useful and you want the full mental model under it, grab it: Generators in JavaScript.

The next time Node tells you it's out of memory, before you reach for a bigger heap, ask whether you ever needed all that data at once in the first place. Usually you didn't.

Cheers :)

Top comments (0)