Cleaning messy CSVs without pandas: 3 tiny no-install scripts

#dataengineering #productivity #python #tutorial

Messy CSV exports are a tax on every data task: stray whitespace, duplicate rows, inconsistent headers, files too big to open. You don't always need pandas for this — Python's built-in csv module handles most of it with zero dependencies and code you can drop on any machine.

Here are three small patterns I reach for constantly.

1. Clean: dedupe, trim, fix headers

import csv

def clean(path, out):
    seen = set()
    with open(path, newline='', encoding='utf-8-sig') as f:
        rows = list(csv.reader(f))
    header = [h.strip().lower().replace(' ', '_') for h in rows[0]]
    cleaned = []
    for raw in rows[1:]:
        cells = [c.strip() for c in raw]
        if all(c == '' for c in cells):
            continue          # drop empty rows
        key = tuple(cells)
        if key in seen:
            continue          # drop duplicates
        seen.add(key)
        cleaned.append(cells)
    with open(out, 'w', newline='', encoding='utf-8') as f:
        w = csv.writer(f)
        w.writerow(header)
        w.writerows(cleaned)

Trims every cell, normalizes headers (First Name -> first_name), and removes empty/duplicate rows.

2. Split a huge CSV into chunks

import csv

def split(path, rows_per_file):
    with open(path, newline='', encoding='utf-8-sig') as f:
        reader = csv.reader(f)
        header = next(reader)
        chunk, part = [], 1
        for row in reader:
            chunk.append(row)
            if len(chunk) >= rows_per_file:
                _write(f'part{part}.csv', header, chunk); part += 1; chunk = []
        if chunk:
            _write(f'part{part}.csv', header, chunk)

def _write(name, header, rows):
    with open(name, 'w', newline='', encoding='utf-8') as f:
        w = csv.writer(f); w.writerow(header); w.writerows(rows)

3. Merge many CSVs into one

import csv, glob

def merge(pattern, out):
    header = None
    with open(out, 'w', newline='', encoding='utf-8') as o:
        w = csv.writer(o)
        for path in glob.glob(pattern):
            with open(path, newline='', encoding='utf-8-sig') as f:
                r = csv.reader(f)
                h = next(r)
                if header is None:
                    header = h; w.writerow(header)
                w.writerows(r)

Why no pandas?

For one-off cleanups and small tools, the stdlib csv module is faster to ship: no install, runs anywhere Python 3.8+ runs, and the code stays readable enough to tweak for your own rules.

I cleaned these up into a tiny toolkit (proper CLI flags, edge cases, comments) so I stop rewriting them. If you'd rather grab them ready-made, it's $10 with full source: https://ko-fi.com/s/bfedf3fb78

What's your go-to for quick CSV wrangling?