Hammad Khan

Posted on May 13

Designing Idempotent Bulk Import Pipelines (E.164, VIN, and the Rest)

#systemdesign #validation #node #datapipelines

Designing Idempotent Bulk Import Pipelines (E.164, VIN, and the Rest)

Bulk imports are a special category of pain. You give the user a CSV uploader, they hand you a file, and your job is to take that file from "blob of text" to "thousands of clean rows in your database." Somewhere in the middle, every assumption you have about your data falls apart.

I rebuilt the bulk import system for a dealership SaaS recently. The redesign covered phone number normalization, VIN format validation, file-format consistency, and the structural decision that makes the whole thing reliable: idempotency.

If you've ever shipped a bulk importer, you've probably hit the same problems. Here's the shape of the design that finally worked.

The four problems

Any bulk import system has to solve four problems:

Validation. Every row has to satisfy schema constraints before it goes into the database. Some constraints are obvious (required fields), some are domain-specific (a VIN must be 17 characters in a specific alphabet).
Normalization. "077 12345" and "+447712345" and "01234 567 890" might all need to become +441234567890 before they hit the database. The user doesn't think about this; you have to.
Error reporting. When 200 of 5,000 rows fail validation, the user needs to know which 200 and why. A flat "import failed" tells them nothing.
Idempotency. When the upload crashes halfway through and the user clicks "Retry," the second run must not double-write the first 2,500 rows.

The fourth problem is the one teams skip. It's also the one that breaks production. Let me start there.

Idempotency: the key decision

The pattern that works is stable per-row keys with an upsert at the DB level.

For each row, compute a stable key from the row's natural identity. For dealership customers, that might be (dealer_id, email_lowercase). For vehicles, (dealer_id, vin). Whatever uniquely identifies the row in the customer's mental model.

function customerKey(row: CustomerRow, dealerId: string): string {
  return `${dealerId}:${row.email.trim().toLowerCase()}`
}

Then use that key as the basis for an upsert:

INSERT INTO customers (dealer_id, email, name, phone, ...)
VALUES ($1, $2, $3, $4, ...)
ON CONFLICT (dealer_id, email) DO UPDATE SET
  name = EXCLUDED.name,
  phone = EXCLUDED.phone,
  ...
  updated_at = NOW();

Now if the import crashes after row 2,500 and the user retries, the first 2,500 rows update themselves in-place (no-op if the data didn't change) and the remaining 2,500 get inserted. Same end state as if the import had run cleanly the first time.

This sounds obvious. Most bulk import code I see doesn't do it. Most code uses INSERT INTO ... VALUES (...) and crashes on the unique-constraint violation when the user retries.

Normalize once, at the entry point

The normalization rule: do it once, as close to the file parsing as possible. Don't sprinkle normalization throughout the codebase.

// import-normalize.ts
import { parsePhoneNumberFromString } from 'libphonenumber-js'

export function normalizePhone(raw: string, defaultCountry: string): string | null {
  if (!raw || !raw.trim()) return null
  // Auto-prepend country code if user typed a local number without +
  const candidate = raw.trim().startsWith('+') ? raw.trim() : raw.trim()
  const parsed = parsePhoneNumberFromString(candidate, defaultCountry as any)
  if (!parsed?.isValid()) return null
  return parsed.format('E.164')  // e.g. "+441234567890"
}

export function normalizeVin(raw: string): string | null {
  if (!raw) return null
  const cleaned = raw.trim().toUpperCase().replace(/[^A-HJ-NPR-Z0-9]/g, '')
  // VIN must be 17 chars; no I, O, or Q allowed
  if (cleaned.length !== 17) return null
  return cleaned
}

export function normalizeEmail(raw: string): string | null {
  if (!raw) return null
  const trimmed = raw.trim().toLowerCase()
  return /^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(trimmed) ? trimmed : null
}

Three things in here that look small:

Each function returns null on invalid input, not a thrown exception. The caller decides what to do with invalid rows (skip them, surface an error). Throwing inside normalization makes the loop harder to reason about.
Phone normalization uses libphonenumber-js, not a regex. Phone numbers are too varied for regex to handle correctly. The library handles country codes, formatting, valid-number checks, and the auto-prepend logic.
VIN normalization knows about the VIN alphabet. The letters I, O, and Q are not valid VIN characters (they look too much like 1 and 0). Strip them out during cleanup, and reject if the result isn't 17 characters.

The validation pipeline

The pipeline as a whole looks like this:

type ImportResult = {
  validRows: Customer[]
  errors: Array<{ row: number; field: string; reason: string }>
}

async function processImport(file: Buffer, dealerId: string, dealerCountry: string): Promise<ImportResult> {
  const parsed = parseCsv(file)
  const errors: ImportResult['errors'] = []
  const validRows: Customer[] = []

  for (let i = 0; i < parsed.length; i++) {
    const row = parsed[i]
    const rowNum = i + 2  // 1-indexed, plus header

    const email = normalizeEmail(row.email)
    if (!email) {
      errors.push({ row: rowNum, field: 'email', reason: 'invalid email' })
      continue
    }

    const phone = normalizePhone(row.phone, dealerCountry)
    if (row.phone && !phone) {
      errors.push({ row: rowNum, field: 'phone', reason: 'invalid phone' })
      continue
    }

    const name = (row.name ?? '').trim()
    if (!name) {
      errors.push({ row: rowNum, field: 'name', reason: 'required' })
      continue
    }

    validRows.push({ dealerId, email, phone, name })
  }

  return { validRows, errors }
}

A few choices to point out:

continue on validation error, don't return. Process every row, accumulate every error. The user wants to see all 200 errors at once, not the first one.
Row numbers are 1-indexed plus the header. The user's spreadsheet starts at row 1 (header) and the first data row is row 2. Match the user's mental model.
Errors capture row, field, and reason. This is what gets surfaced back to the UI. "Row 47: phone — invalid phone" is what the user needs to fix the spreadsheet.

The atomic write

Now you've got validRows. The write step is a single transactional upsert:

async function writeRows(rows: Customer[]) {
  if (rows.length === 0) return { inserted: 0, updated: 0 }

  return await db.transaction(async (tx) => {
    const result = await tx
      .insertInto('customers')
      .values(rows)
      .onConflict((oc) => oc.columns(['dealer_id', 'email']).doUpdateSet({
        name: (eb) => eb.ref('excluded.name'),
        phone: (eb) => eb.ref('excluded.phone'),
        updated_at: () => sql`now()`,
      }))
      .returning(['id', 'xmax'])  // xmax = 0 means insert, otherwise update
      .execute()

    const inserted = result.filter(r => r.xmax === '0').length
    const updated = result.length - inserted
    return { inserted, updated }
  })
}

Three things:

Single transaction. Either all 4,800 valid rows make it in, or none do. No partial state for the user to discover later.
xmax trick to distinguish inserts from updates. Postgres-specific: when INSERT ... ON CONFLICT DO UPDATE performs an insert, the returning row's xmax is 0. On an update, xmax is non-zero. This lets you tell the user "inserted 3,200, updated 1,600."
updated_at on conflict. Always touch the timestamp, even if no other fields changed. The audit trail wants to know the row was re-imported.

For very large imports (50,000+ rows), this single transaction will become a problem (long-held locks, replication lag). The fix is to batch:

const BATCH_SIZE = 1000
for (let i = 0; i < rows.length; i += BATCH_SIZE) {
  const batch = rows.slice(i, i + BATCH_SIZE)
  await writeRows(batch)
}

Each batch is its own transaction. You lose all-or-nothing across batches, but you gain the ability to scale to large imports. For our dealership use case (rare imports of 5,000-50,000 rows), this trade was right.

What about the "supply-chain" failures?

The original code we replaced loaded its xlsx parser from a CDN. We had a separate write-up on why that's a bad idea (and a separate fix). For the bulk-import context specifically: the parser runs against user-uploaded files. If the parser is malicious, every uploaded file is exfiltrated. Vendor the parser, verify its hash in CI, and don't load it from a remote URL. (Details in the companion article on the xlsx CDN supply-chain hole.)

The user-facing report

The validation step returns errors. The UI needs to show them in a way the user can act on. Two patterns work:

Inline error report. After upload, show "5,200 rows imported, 200 errors. Download error report?" The download is a CSV with the user's original rows plus an error column. They fix the spreadsheet and re-upload.
Reject-on-any-error. If any row has an error, reject the whole upload, surface the errors, and don't write anything. This is right for cases where partial imports are dangerous (vehicle records that have FK dependencies on other tables, for example).

For our customer imports, option 1 was right. For our vehicle imports, option 2 was right. Pick deliberately. Document the choice. Don't let the default be "import what we can, drop the rest silently" — that's how dealers find missing customers six months later.

The full pipeline shape

To summarize the design:

Parse the file into raw rows. No transformation here.
Normalize each field (phone, email, VIN, etc.) to the canonical form.
Validate each row. Collect errors. Don't stop on the first failure.
Decide based on the policy: reject-on-error or skip-bad-rows-and-continue.
Upsert valid rows in a transaction (or batched transactions), using a stable natural key.
Report back to the user: counts, errors, downloadable diff.

Each step is a small, testable function. The pipeline as a whole is reproducible: same file, same dealer config, same result. Crashes mid-way are recoverable because of idempotency.

The takeaway

Bulk imports become reliable once you commit to two ideas: normalize at the boundary, and make every write idempotent via a stable key. Once those two are in place, retries are free, partial failures are recoverable, and "the import broke" stops being a thing your support team has to handle.

The hardest part of building one isn't the validation. It's having the conversation about what your stable key actually is — and being willing to make the user pick one if the data doesn't have one obvious.

DEV Community

Designing Idempotent Bulk Import Pipelines (E.164, VIN, and the Rest)

Designing Idempotent Bulk Import Pipelines (E.164, VIN, and the Rest)

The four problems

Idempotency: the key decision

Normalize once, at the entry point

The validation pipeline

The atomic write

What about the "supply-chain" failures?

The user-facing report

The full pipeline shape

The takeaway

Top comments (0)