DEV Community

Cover image for What Most CSV Ingestion Scripts Get Wrong (And How to Fix It)
Looplylabs
Looplylabs

Posted on

What Most CSV Ingestion Scripts Get Wrong (And How to Fix It)

Most CSV ingestion scripts are written in 30 minutes.

Most ingestion failures take 3 months to notice.

The problem isn’t CSV.

The problem is missing guarantees.

In small teams, CSV ingestion often looks like this:

  • Read file
  • Loop rows
  • Insert into database
  • Print “Done”
  • It works.

Until the export format changes.
Until the file is empty.
Until duplicates accumulate.
Until a partial insert corrupts reporting.

Here’s what most ingestion scripts get wrong.


1. They Don’t Validate Structure Explicitly

Many scripts assume the column order never changes.

That assumption eventually breaks.

Instead of trusting positional mapping, validate headers explicitly:

EXPECTED_HEADERS = [
    "date",
    "customer_id",
    "amount",
    "currency",
    "status"
]

if headers != EXPECTED_HEADERS:
    raise ValueError("Schema mismatch detected")
Enter fullscreen mode Exit fullscreen mode

Order-sensitive comparison is intentional.

If upstream changes, ingestion should stop immediately.

Silent drift is worse than a crash.


2. They Don’t Sanity-Check Volume

An empty CSV import should not succeed.

A report with 12 rows instead of 1,200 should not quietly pass.

Add simple guardrails:

if len(rows) == 0:
    raise RuntimeError("Empty export detected")

if len(rows) < expected_minimum:
    raise RuntimeError("Suspiciously low record count")
Enter fullscreen mode Exit fullscreen mode

Most data corruption is not catastrophic.

It’s gradual.


3. They Allow Partial Inserts

Without transactions, one failed row can leave the database in an inconsistent state.

Use transactional boundaries:

import psycopg2

with conn:
    with conn.cursor() as cur:
        for row in rows:
            cur.execute(
                "INSERT INTO reports VALUES (%s,%s,%s,%s,%s)",
                row
            )
Enter fullscreen mode Exit fullscreen mode

If one insert fails, the entire operation rolls back.

No half-written states.


4. They Ignore Idempotency

If a job runs twice, does it duplicate records?

If yes, retries become dangerous.

Use unique constraints and upserts:

INSERT INTO reports (id, amount)
VALUES (%s, %s)
ON CONFLICT (id)
DO UPDATE SET amount = EXCLUDED.amount;

Enter fullscreen mode Exit fullscreen mode

Idempotency turns retries from risk into safety.


5. They Depend on Humans

If someone manually uploads a CSV every week, the workflow depends on memory.

Replace the ritual with scheduling:

0 2 * * 1 /usr/bin/python3 ingest_reports.py >> ingestion.log 2>&1

Humans forget.

Cron does not.


The Underlying Principle

Automation is not about execution.

It’s about deterministic state transitions.

A safe ingestion pipeline guarantees:

  • Structural integrity
  • Volume sanity
  • Atomic writes
  • Safe retries

Everything else is optimism.

Further Reading

I wrote a deeper breakdown of deterministic ingestion architecture — including file archival, observability, and production safeguards — here:

Automating CSV to PostgreSQL Safely Using Python (Determini… | Looplylabs Blog

Learn how to replace fragile manual CSV imports with a deterministic Python ingestion pipeline using schema validation, row verification, transactions, and sch…

favicon looplylabs.com

Top comments (0)