Looplylabs

Posted on Mar 2

What Most CSV Ingestion Scripts Get Wrong (And How to Fix It)

#python #postgres #backend #automation

Most CSV ingestion scripts are written in 30 minutes.

Most ingestion failures take 3 months to notice.

The problem isn’t CSV.

The problem is missing guarantees.

In small teams, CSV ingestion often looks like this:

Read file
Loop rows
Insert into database
Print “Done”
It works.

Until the export format changes.
Until the file is empty.
Until duplicates accumulate.
Until a partial insert corrupts reporting.

Here’s what most ingestion scripts get wrong.

1. They Don’t Validate Structure Explicitly

Many scripts assume the column order never changes.

That assumption eventually breaks.

Instead of trusting positional mapping, validate headers explicitly:

EXPECTED_HEADERS = [
    "date",
    "customer_id",
    "amount",
    "currency",
    "status"
]

if headers != EXPECTED_HEADERS:
    raise ValueError("Schema mismatch detected")

Order-sensitive comparison is intentional.

If upstream changes, ingestion should stop immediately.

Silent drift is worse than a crash.

2. They Don’t Sanity-Check Volume

An empty CSV import should not succeed.

A report with 12 rows instead of 1,200 should not quietly pass.

Add simple guardrails:

if len(rows) == 0:
    raise RuntimeError("Empty export detected")

if len(rows) < expected_minimum:
    raise RuntimeError("Suspiciously low record count")

Most data corruption is not catastrophic.

It’s gradual.

3. They Allow Partial Inserts

Without transactions, one failed row can leave the database in an inconsistent state.

Use transactional boundaries:

import psycopg2

with conn:
    with conn.cursor() as cur:
        for row in rows:
            cur.execute(
                "INSERT INTO reports VALUES (%s,%s,%s,%s,%s)",
                row
            )

If one insert fails, the entire operation rolls back.

No half-written states.

4. They Ignore Idempotency

If a job runs twice, does it duplicate records?

If yes, retries become dangerous.

Use unique constraints and upserts:

INSERT INTO reports (id, amount)
VALUES (%s, %s)
ON CONFLICT (id)
DO UPDATE SET amount = EXCLUDED.amount;

Idempotency turns retries from risk into safety.

5. They Depend on Humans

If someone manually uploads a CSV every week, the workflow depends on memory.

Replace the ritual with scheduling:

0 2 * * 1 /usr/bin/python3 ingest_reports.py >> ingestion.log 2>&1

Humans forget.

Cron does not.

The Underlying Principle

Automation is not about execution.

It’s about deterministic state transitions.

A safe ingestion pipeline guarantees:

Structural integrity
Volume sanity
Atomic writes
Safe retries

Everything else is optimism.

Automating CSV to PostgreSQL Safely Using Python (Determini… | Looplylabs

Learn how to replace fragile manual CSV imports with a deterministic Python ingestion pipeline using schema validation, row verification, transactions, and sch…

looplylabs.com