How We Built a Deterministic File Import Pipeline in TypeScript (CSV, XLSX, ZIP)
Most file importers look good in a demo.
Production is different:
- headers are inconsistent
- users upload many files at once
- ZIP files include random extra files
- retries create duplicates
- support gets flooded with “why did this fail?”
While building multischema.com, we made one rule non-negotiable:
Same input + same schema version = same output. Every time.
1) Determinism first
Our import flow behaves like a pure function:
- normalize + sort files in a stable order
- build a deterministic
runKey - reuse previous result for retries instead of reprocessing
ts
const filesInOrder = files
.map((f) => ({ path: f.path, size: f.size, sha256: f.sha256 }))
.sort((a, b) => a.path.localeCompare(b.path));
const runKey = sha256(
filesInOrder.map((f) => `${f.path}:${f.size}:${f.sha256}`).join("|") +
`|schema:${schemaVersion}`
);
const existing = await getImportRunByKey(runKey);
if (existing) return existing.result;
This single change removed most duplicate-import chaos.
2) ZIP support without UX pain
ZIP uploads are common, but they often contain:
hidden system files
screenshots/PDFs
unrelated exports
We don’t fail the whole run.
We process valid files and return a skipped-file summary with reasons.
const supported = new Set([".csv", ".xlsx", ".xls"]);
for (const entry of zipEntries) {
const ext = extname(entry.name).toLowerCase();
if (!supported.has(ext)) {
skipped.push({ file: entry.name, reason: "Unsupported file type" });
continue;
}
accepted.push(entry);
}
User-facing result:
Imported: 4 files
Skipped: 3 files
Reasons: unsupported type, empty file, invalid format
Clear summary = fewer support tickets.
3) Schema mapping is a real pipeline step
Most failures happen before business logic, during column mapping.
We treat mapping as a first-class stage:
parse file
normalize headers
map to schema fields
validate rows
upsert records
Each error is structured: code, row, column, message.
Users can fix data quickly instead of guessing.
4) Idempotent writes (not just idempotent processing)
Deterministic processing still fails if writes duplicate rows.
Use stable keys for upsert:
invoice_number + vendor_id
external_id
deterministic record hash (only if no natural key exists)
INSERT INTO invoices (vendor_id, invoice_number, amount, due_date)
VALUES (?, ?, ?, ?)
ON CONFLICT (vendor_id, invoice_number)
DO UPDATE SET
amount = excluded.amount,
due_date = excluded.due_date,
updated_at = CURRENT_TIMESTAMP;
No stable key = no reliable import.
5) UX quality matters as much as parser quality
Reliability is invisible if UX is vague.
For every run we show:
accepted files
skipped files + reasons
row-level validation errors
inserted / updated / failed counts
Developers care about correctness.
Operators care about clarity.
You need both.
Final takeaway
A production-grade importer is not “parse CSV and hope.”
It should be:
deterministic
idempotent
schema-aware
explicit about skipped/failed inputs
predictable across retries
If you’re building import flows, start with determinism first. Everything else gets easier after that.
I’m happy to share a follow-up on queue architecture (upload API, worker pipeline, retries, and backpressure) if there’s interest.
Top comments (0)