DEV Community

Michael Lip
Michael Lip

Posted on • Originally published at zovo.one

Why Your CSV-to-JSON Pipeline Needs More Than a One-Line Script

I maintain a data pipeline that ingests CSV files from three different vendors and converts them to JSON for processing. Each vendor has different quoting conventions, different encodings, and different ideas about what constitutes a valid CSV. A one-line conversion script lasted exactly one day before the edge cases started rolling in.

If you are converting CSV to JSON in production, you need to handle more than just comma splitting.

The vendor problem

Vendor A sends UTF-8 CSV files with proper RFC 4180 quoting. Vendor B sends Windows-1252 encoded files with semicolons as delimiters (common in European locales). Vendor C sends tab-delimited files with a .csv extension and no quoting whatsoever.

All three arrive as "CSV files." Your converter needs to handle all three or produce garbage output for two of them.

Detection strategies:

Encoding detection: Read the first few bytes. A BOM (byte order mark) identifies UTF-8 (EF BB BF), UTF-16 LE (FF FE), or UTF-16 BE (FE FF). Without a BOM, try parsing as UTF-8 first; if that fails or produces replacement characters, fall back to Windows-1252.

Delimiter detection: Count occurrences of common delimiters (comma, semicolon, tab, pipe) in the first few lines. The character that appears most consistently across lines is likely the delimiter.

Quote detection: Check if the file uses double quotes, single quotes, or no quoting. RFC 4180 specifies double quotes, but real-world files use anything.

Handling large files

Small CSV files can be loaded entirely into memory, converted, and written out. Large files (hundreds of megabytes to gigabytes) require streaming.

In Node.js, a streaming approach:

const readline = require('readline');
const fs = require('fs');

const rl = readline.createInterface({
  input: fs.createReadStream('large.csv'),
  crlfDelay: Infinity
});

const output = fs.createWriteStream('output.json');
output.write('[\n');

let headers = null;
let first = true;

rl.on('line', (line) => {
  const fields = parseLine(line); // proper CSV parsing
  if (!headers) {
    headers = fields;
    return;
  }
  const obj = {};
  headers.forEach((h, i) => obj[h] = fields[i] || '');
  if (!first) output.write(',\n');
  output.write(JSON.stringify(obj));
  first = false;
});

rl.on('close', () => {
  output.write('\n]');
  output.end();
});
Enter fullscreen mode Exit fullscreen mode

This processes line by line without loading the entire file into memory. However, it does not handle multiline quoted fields, which is a limitation of the readline approach.

Data validation during conversion

A robust converter validates data during conversion:

  • Row length consistency: Flag rows with more or fewer fields than the header
  • Required field checks: Ensure critical columns are not empty
  • Type validation: If a column should be numeric, flag non-numeric values
  • Duplicate detection: Identify duplicate rows or duplicate values in key columns

These checks are significantly easier to implement during the conversion step than as a separate pass.

JSON output formatting

For API consumption, compact JSON is appropriate. For human inspection, pretty-printed JSON with indentation is better. For line-by-line processing, NDJSON (newline-delimited JSON) is optimal -- one JSON object per line, no wrapping array:

{"name":"Alice","age":"30"}
{"name":"Bob","age":"25"}
Enter fullscreen mode Exit fullscreen mode

NDJSON can be processed line-by-line without parsing the entire file, making it ideal for large datasets and streaming pipelines.

The converter tool

For quick, reliable CSV-to-JSON conversion that handles encoding detection, delimiter detection, and proper quoting, I built a CSV to JSON converter that runs in the browser with no file uploads to external servers. Drop in your CSV, configure the options, get clean JSON.


I'm Michael Lip. I build free developer tools at zovo.one. 500+ tools, all private, all free.

Top comments (0)