A CSV Editor With RFC 4180 Parsing, Auto Delimiter Detection, and Markdown Export

#javascript #webdev #csv #data

A CSV Editor With RFC 4180 Parsing, Auto Delimiter Detection, and Markdown Export

Parsing CSV correctly means handling quoted fields, doubled-quote escapes, embedded newlines inside quotes, and trailing newlines. All of RFC 4180 is about 30 lines of state-machine JavaScript. Once you have that, exporting to Markdown / JSON / HTML / TSV is a few more functions on top.

Everyone underestimates CSV. line.split(',') works for 90% of files and fails badly on the other 10%. The real spec is RFC 4180, which allows fields to contain commas, quotes, and newlines if they're wrapped in double quotes, with doubled quotes as an escape.

🔗 Live demo: https://sen.ltd/portfolio/csv-tool/
📦 GitHub: https://github.com/sen-ltd/csv-tool

Features:

RFC 4180 parser (handles quotes, escapes, embedded newlines)
Auto delimiter detection (comma, tab, semicolon, pipe)
Editable table with sort, search, pagination
Add / remove rows and columns
Export: CSV, TSV, JSON, Markdown, HTML
Per-column type detection (number / date / boolean / string)
Japanese / English UI
Zero dependencies, 76 tests

The state machine

CSV parsing is a 3-state state machine: outside-quotes, inside-quotes, inside-quotes-just-saw-quote. The last state is where doubled-quote escapes get resolved:

export function parseCSV(text, delimiter = ',') {
  const rows = [];
  let row = [];
  let field = '';
  let inQuotes = false;
  let i = 0;

  while (i < text.length) {
    const c = text[i];

    if (inQuotes) {
      if (c === '"') {
        if (text[i + 1] === '"') {
          // Escaped quote
          field += '"';
          i += 2;
        } else {
          // End of quoted field
          inQuotes = false;
          i++;
        }
      } else {
        field += c;
        i++;
      }
    } else {
      if (c === '"') {
        inQuotes = true;
        i++;
      } else if (c === delimiter) {
        row.push(field);
        field = '';
        i++;
      } else if (c === '\n' || c === '\r') {
        row.push(field);
        rows.push(row);
        row = [];
        field = '';
        if (c === '\r' && text[i + 1] === '\n') i++;
        i++;
      } else {
        field += c;
        i++;
      }
    }
  }

  // Flush last field/row
  if (field || row.length > 0) {
    row.push(field);
    rows.push(row);
  }

  return rows;
}

The tricky cases:

"foo,bar" → single field foo,bar (comma inside quotes is literal)
"foo""bar" → single field foo"bar (doubled quote escapes)
"line1\nline2" → single field with embedded newline
Mixed \r\n and \n line endings → both work

Delimiter detection

If the user doesn't specify, we guess by counting delimiter occurrences per line and picking the one with the most consistent count:

export function detectDelimiter(text) {
  const candidates = [',', '\t', ';', '|'];
  const lines = text.split(/\r?\n/).slice(0, 5);
  let best = ',';
  let bestScore = -1;
  for (const delim of candidates) {
    const counts = lines.map(l => (l.match(new RegExp(delim === '\t' ? '\\t' : '\\' + delim, 'g')) || []).length);
    if (counts[0] === 0) continue;
    const consistent = counts.every(c => c === counts[0]);
    const score = counts[0] * (consistent ? 2 : 1);
    if (score > bestScore) { bestScore = score; best = delim; }
  }
  return best;
}

Consistency matters more than raw count — "a,b,c" in one line and "d,e" in the next is suspicious, but "a,b,c" repeated every line is definitive.

Column type detection

Walk each column and check if all values match a type:

export function detectColumnType(values) {
  const nonEmpty = values.filter(v => v !== '' && v != null);
  if (nonEmpty.length === 0) return 'string';
  if (nonEmpty.every(isValidNumber)) return 'number';
  if (nonEmpty.every(isValidDate)) return 'date';
  if (nonEmpty.every(isValidBoolean)) return 'boolean';
  return 'string';
}

Order matters: check number before boolean, because "1" is a valid boolean but also a valid number — and "number" is the more useful classification.

Markdown table output

A properly-formatted markdown table has column widths padded for readability:

export function toMarkdown(rows, hasHeader = true) {
  if (rows.length === 0) return '';
  const widths = rows[0].map((_, colIdx) => 
    Math.max(...rows.map(r => (r[colIdx] || '').length))
  );
  const lines = [];
  const formatRow = (r) => '| ' + r.map((cell, i) => (cell || '').padEnd(widths[i])).join(' | ') + ' |';
  lines.push(formatRow(rows[0]));
  if (hasHeader) {
    lines.push('|' + widths.map(w => '-'.repeat(w + 2)).join('|') + '|');
  }
  for (let i = 1; i < rows.length; i++) {
    lines.push(formatRow(rows[i]));
  }
  return lines.join('\n');
}

Note the CJK caveat: padEnd counts UTF-16 code units, not display width. A monospace-rendered Japanese character is often 2 cells wide, so the output may look slightly misaligned in a terminal even though the Markdown is semantically correct.