Most CSV ingestion scripts are written in 30 minutes.
Most ingestion failures take 3 months to notice.
The problem isn’t CSV.
The problem is missing guarantees.
In small teams, CSV ingestion often looks like this:
- Read file
- Loop rows
- Insert into database
- Print “Done”
- It works.
Until the export format changes.
Until the file is empty.
Until duplicates accumulate.
Until a partial insert corrupts reporting.
Here’s what most ingestion scripts get wrong.
1. They Don’t Validate Structure Explicitly
Many scripts assume the column order never changes.
That assumption eventually breaks.
Instead of trusting positional mapping, validate headers explicitly:
EXPECTED_HEADERS = [
"date",
"customer_id",
"amount",
"currency",
"status"
]
if headers != EXPECTED_HEADERS:
raise ValueError("Schema mismatch detected")
Order-sensitive comparison is intentional.
If upstream changes, ingestion should stop immediately.
Silent drift is worse than a crash.
2. They Don’t Sanity-Check Volume
An empty CSV import should not succeed.
A report with 12 rows instead of 1,200 should not quietly pass.
Add simple guardrails:
if len(rows) == 0:
raise RuntimeError("Empty export detected")
if len(rows) < expected_minimum:
raise RuntimeError("Suspiciously low record count")
Most data corruption is not catastrophic.
It’s gradual.
3. They Allow Partial Inserts
Without transactions, one failed row can leave the database in an inconsistent state.
Use transactional boundaries:
import psycopg2
with conn:
with conn.cursor() as cur:
for row in rows:
cur.execute(
"INSERT INTO reports VALUES (%s,%s,%s,%s,%s)",
row
)
If one insert fails, the entire operation rolls back.
No half-written states.
4. They Ignore Idempotency
If a job runs twice, does it duplicate records?
If yes, retries become dangerous.
Use unique constraints and upserts:
INSERT INTO reports (id, amount)
VALUES (%s, %s)
ON CONFLICT (id)
DO UPDATE SET amount = EXCLUDED.amount;
Idempotency turns retries from risk into safety.
5. They Depend on Humans
If someone manually uploads a CSV every week, the workflow depends on memory.
Replace the ritual with scheduling:
0 2 * * 1 /usr/bin/python3 ingest_reports.py >> ingestion.log 2>&1
Humans forget.
Cron does not.
The Underlying Principle
Automation is not about execution.
It’s about deterministic state transitions.
A safe ingestion pipeline guarantees:
- Structural integrity
- Volume sanity
- Atomic writes
- Safe retries
Everything else is optimism.
Further Reading
I wrote a deeper breakdown of deterministic ingestion architecture — including file archival, observability, and production safeguards — here:
Top comments (0)