DEV Community

Arshav
Arshav

Posted on

How We Built a Deterministic File Import Pipeline in TypeScript (CSV, XLSX, ZIP)

How We Built a Deterministic File Import Pipeline in TypeScript (CSV, XLSX, ZIP)

Most file importers look good in a demo.

Production is different:

  • headers are inconsistent
  • users upload many files at once
  • ZIP files include random extra files
  • retries create duplicates
  • support gets flooded with “why did this fail?”

While building multischema.com, we made one rule non-negotiable:

Same input + same schema version = same output. Every time.

1) Determinism first

Our import flow behaves like a pure function:

  1. normalize + sort files in a stable order
  2. build a deterministic runKey
  3. reuse previous result for retries instead of reprocessing

ts
const filesInOrder = files
  .map((f) => ({ path: f.path, size: f.size, sha256: f.sha256 }))
  .sort((a, b) => a.path.localeCompare(b.path));

const runKey = sha256(
  filesInOrder.map((f) => `${f.path}:${f.size}:${f.sha256}`).join("|") +
    `|schema:${schemaVersion}`
);

const existing = await getImportRunByKey(runKey);
if (existing) return existing.result;
This single change removed most duplicate-import chaos.

2) ZIP support without UX pain
ZIP uploads are common, but they often contain:

hidden system files
screenshots/PDFs
unrelated exports
We don’t fail the whole run.
We process valid files and return a skipped-file summary with reasons.

const supported = new Set([".csv", ".xlsx", ".xls"]);

for (const entry of zipEntries) {
  const ext = extname(entry.name).toLowerCase();

  if (!supported.has(ext)) {
    skipped.push({ file: entry.name, reason: "Unsupported file type" });
    continue;
  }

  accepted.push(entry);
}
User-facing result:

Imported: 4 files
Skipped: 3 files
Reasons: unsupported type, empty file, invalid format
Clear summary = fewer support tickets.

3) Schema mapping is a real pipeline step
Most failures happen before business logic, during column mapping.

We treat mapping as a first-class stage:

parse file
normalize headers
map to schema fields
validate rows
upsert records
Each error is structured: code, row, column, message.
Users can fix data quickly instead of guessing.

4) Idempotent writes (not just idempotent processing)
Deterministic processing still fails if writes duplicate rows.

Use stable keys for upsert:

invoice_number + vendor_id
external_id
deterministic record hash (only if no natural key exists)
INSERT INTO invoices (vendor_id, invoice_number, amount, due_date)
VALUES (?, ?, ?, ?)
ON CONFLICT (vendor_id, invoice_number)
DO UPDATE SET
  amount = excluded.amount,
  due_date = excluded.due_date,
  updated_at = CURRENT_TIMESTAMP;
No stable key = no reliable import.

5) UX quality matters as much as parser quality
Reliability is invisible if UX is vague.

For every run we show:

accepted files
skipped files + reasons
row-level validation errors
inserted / updated / failed counts
Developers care about correctness.
Operators care about clarity.
You need both.

Final takeaway
A production-grade importer is not “parse CSV and hope.”

It should be:

deterministic
idempotent
schema-aware
explicit about skipped/failed inputs
predictable across retries
If you’re building import flows, start with determinism first. Everything else gets easier after that.

I’m happy to share a follow-up on queue architecture (upload API, worker pipeline, retries, and backpressure) if there’s interest.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)