How We Built a Deterministic File Import Pipeline in TypeScript (CSV, XLSX, ZIP)

#backend #dataengineering #saas #typescript

How We Built a Deterministic File Import Pipeline in TypeScript (CSV, XLSX, ZIP)

Most file importers look good in a demo.

Production is different:

headers are inconsistent
users upload many files at once
ZIP files include random extra files
retries create duplicates
support gets flooded with “why did this fail?”

While building multischema.com, we made one rule non-negotiable:

Same input + same schema version = same output. Every time.

1) Determinism first

Our import flow behaves like a pure function:

normalize + sort files in a stable order
build a deterministic runKey
reuse previous result for retries instead of reprocessing


ts
const filesInOrder = files
  .map((f) => ({ path: f.path, size: f.size, sha256: f.sha256 }))
  .sort((a, b) => a.path.localeCompare(b.path));

const runKey = sha256(
  filesInOrder.map((f) => `${f.path}:${f.size}:${f.sha256}`).join("|") +
    `|schema:${schemaVersion}`
);

const existing = await getImportRunByKey(runKey);
if (existing) return existing.result;
This single change removed most duplicate-import chaos.

2) ZIP support without UX pain
ZIP uploads are common, but they often contain:

hidden system files
screenshots/PDFs
unrelated exports
We don’t fail the whole run.
We process valid files and return a skipped-file summary with reasons.

const supported = new Set([".csv", ".xlsx", ".xls"]);

for (const entry of zipEntries) {
  const ext = extname(entry.name).toLowerCase();

  if (!supported.has(ext)) {
    skipped.push({ file: entry.name, reason: "Unsupported file type" });
    continue;
  }

  accepted.push(entry);
}
User-facing result:

Imported: 4 files
Skipped: 3 files
Reasons: unsupported type, empty file, invalid format
Clear summary = fewer support tickets.

3) Schema mapping is a real pipeline step
Most failures happen before business logic, during column mapping.

We treat mapping as a first-class stage:

parse file
normalize headers
map to schema fields
validate rows
upsert records
Each error is structured: code, row, column, message.
Users can fix data quickly instead of guessing.

4) Idempotent writes (not just idempotent processing)
Deterministic processing still fails if writes duplicate rows.

Use stable keys for upsert:

invoice_number + vendor_id
external_id
deterministic record hash (only if no natural key exists)
INSERT INTO invoices (vendor_id, invoice_number, amount, due_date)
VALUES (?, ?, ?, ?)
ON CONFLICT (vendor_id, invoice_number)
DO UPDATE SET
  amount = excluded.amount,
  due_date = excluded.due_date,
  updated_at = CURRENT_TIMESTAMP;
No stable key = no reliable import.

5) UX quality matters as much as parser quality
Reliability is invisible if UX is vague.

For every run we show:

accepted files
skipped files + reasons
row-level validation errors
inserted / updated / failed counts
Developers care about correctness.
Operators care about clarity.
You need both.

Final takeaway
A production-grade importer is not “parse CSV and hope.”

It should be:

deterministic
idempotent
schema-aware
explicit about skipped/failed inputs
predictable across retries
If you’re building import flows, start with determinism first. Everything else gets easier after that.

I’m happy to share a follow-up on queue architecture (upload API, worker pipeline, retries, and backpressure) if there’s interest.