Bruno Hanss

Posted on Feb 14

Converting Large JSON, NDJSON, CSV and XML Files without Blowing Up Memory

#javascript #datascience #node #opensource

Most of us have written something like this at some point:

const data = JSON.parse(hugeString);

It works.

Until it doesn't.

At some point the file grows.\
50MB. 200MB. 1GB. 5GB.

And suddenly:

The tab freezes (in the browser)
Memory spikes
The process crashes
Or worse --- everything technically "works" but becomes unusable

This isn't a JavaScript problem.

It's a buffering problem.

The Real Issue: Buffering vs Streaming

Most parsing libraries operate in buffer mode:

Read the entire file into memory
Parse it completely
Return the result

That means memory usage scales with file size.

Streaming flips the model:

Read chunks
Process incrementally
Emit records progressively
Keep memory nearly constant

That architectural difference matters far more than micro-optimizations.

Why I Built a Streaming Converter

I've been working on a project called convert-buddy-js, a Rust-based
streaming conversion engine compiled to WebAssembly and exposed as a
JavaScript library.

It supports:

XML
CSV
JSON
NDJSON

The core goal was simple:

Keep memory usage flat, even as file size grows.

Not "be the fastest library ever."\
Just predictable. Stable. Bounded.

What Does "Low Memory" Actually Mean?

Here's an example from benchmarks converting XML → JSON.

Scenario	Tool	File Size	Memory Usage
xml-large	convert-buddy	38.41 MB	~0 MB change
xml-large	fast-xml-parser	38.41 MB	377 MB

The difference is architectural.

The streaming engine processes elements incrementally instead of
constructing large intermediate structures.

CSV → JSON Benchmarks

I benchmarked against:

PapaParse\
csv-parse\
fast-csv

Here's a representative neutral case (1.26 MB CSV):

Tool	Throughput
convert-buddy	75.96 MB/s
csv-parse	22.13 MB/s
PapaParse	19.57 MB/s
fast-csv	15.65 MB/s

In favorable large cases (13.52 MB CSV):

Tool	Throughput
convert-buddy	91.88 MB/s
csv-parse	30.68 MB/s
PapaParse	24.69 MB/s
fast-csv	19.68 MB/s

In most CSV scenarios tested, the streaming approach resulted in roughly
3x--4x throughput improvements, with dramatically lower memory overhead.

Where Streaming Isn't Always Faster

For tiny NDJSON files, native JSON parsing can be faster.

Scenario	Tool	Throughput
NDJSON tiny	Native JSON	27.10 MB/s
NDJSON tiny	convert-buddy	10.81 MB/s

That's expected.

When files are extremely small, the overhead of streaming infrastructure
can outweigh benefits.\
Native JSON.parse is heavily optimized in engines and extremely
efficient for small payloads.

The goal here isn't to replace native JSON for everything.

It's to handle realistic and large workloads predictably.

NDJSON → JSON Performance

For medium nested NDJSON datasets:

Tool	Throughput
convert-buddy	221.79 MB/s
Native JSON	136.84 MB/s

That's where streaming and incremental transformation shine ---
especially when the workload involves structured transformation rather
than just parsing.

What the Library Looks Like

Install:

npm install convert-buddy-js

Then:

import { convert } from "convert-buddy-js";

const csv = 'name,age,city\nAlice,30,NYC\nBob,25,LA\nCarol,35,SF';

// Configure only what you need. Here we output NDJSON.
const buddy = new ConvertBuddy({ outputFormat: 'ndjson' });

// Stream conversion: records are emitted in batches.
const controller = buddy.stream(csv, {
  recordBatchSize: 2,

  // onRecords can be async: await inside it if you need (I/O, UI updates, writes...)
  onRecords: async (ctrl, records, stats, total) => {
    console.log('Batch received:', records);

    // Simulate slow async work (writing, rendering, uploading, etc.)
    await new Promise(r => setTimeout(r, 50));

    // Report progress (ctrl.* is the most reliable live state)
    console.log(
      `Progress: ${ctrl.recordCount} records, ${stats.throughputMbPerSec.toFixed(2)} MB/s`
    );
  },

  onDone: (final) => console.log('Done:', final),

  // Enable profiling stats (throughput, latency, memory estimates, etc.)
  profile: true
});

// Optional: await final stats / completion
const final = await controller.done;
console.log('Final stats:', final);

It works in:

Node
Browser
Web Workers

Because the core engine is written in Rust and compiled to WebAssembly.

Why Rust + WebAssembly?

Not because it's trendy.

Because:

Predictable memory behavior
Strong streaming primitives
Deterministic performance
Easier control over allocations

WebAssembly allows that engine to run safely in the browser without
server uploads.

When This Tool Makes Sense

You probably don't need it if:

Files are always < 1MB
You're already happy with JSON.parse
You don't care about memory spikes

It makes sense if:

You process large CSV exports
You handle XML feeds
You work with NDJSON streams
You need conversion in the browser without uploads
You want predictable memory footprint

What I Learned Building It

Streaming is not just about speed --- it's about stability.
Benchmarks should include losses.
Native JSON.parse is hard to beat for tiny payloads.
Memory predictability matters more than peak throughput.

Closing Thoughts

There are many good parsing libraries in the JavaScript ecosystem.

PapaParse is mature.\
csv-parse is robust.\
Native JSON.parse is extremely optimized.

convert-buddy-js is simply an option focused on:

Streaming
Low memory usage
Format transformation
Large file handling

If that matches your constraints, it may be useful.

If not, the ecosystem already has excellent tools.

If you're curious, the full benchmarks and scenarios are available in
the repository.
convert-buddy-js — npm
brunohanss/convert-buddy

And if you have workloads where streaming would make a difference, I’d be interested in feedback.
You can get more information or try the interactive browser playground here: https://convert-buddy.app/

Top comments (1)

Nicow LaGnôle • Feb 15

Excellent write-up!! Thx ! The distinction between full in-memory parsing and true streaming pipelines is well demonstrated, especially with the cross-format benchmarks (CSV, NDJSON, XML…) The constant-memory behavior you’re achieving is particularly relevant for large-scale ETL and browser-side processing.

Using Rust + WebAssembly for deterministic performance and tighter memory control is a strong architectural choice. I’d be interested seeing more of that boy!
Very solid engineering approach. Keep going