vesper_finch

Posted on Mar 11

I Built a Python CSV Cleaner That Fixes Messy Data in One Command

#beginners #datascience #opensource #python

Every data project starts the same way: you get a CSV and it is a mess. Column names with random spaces, dates in 5 different formats, NULL and N/A and none all meaning the same thing, duplicate rows everywhere.

I built a zero-dependency Python tool that handles all of this automatically.

One Command

python csv_cleaner.py messy_data.csv -o clean_data.csv

No pandas, no pip install, no configuration.

What It Does

Auto-detects encoding

Tries UTF-8, Latin-1, Shift-JIS, and other common encodings automatically.

Standardizes column names

"  Email Address " -> "email_address"
"Revenue ($)"     -> "revenue"
"Full Name"        -> "full_name"

Normalizes null values

All become empty strings: NULL, N/A, None, na, nan, #N/A, -, --, missing, undefined

Fixes whitespace

" Jane Doe  " -> "Jane Doe"
"double  space" -> "double space"

Removes duplicates, normalizes dates, detects types and outliers

Auto-detects 9+ date formats. Infers column types (integer, float, date, boolean, string). Uses IQR to flag outliers.

The Report

Rows:    1,247 -> 1,198 (49 duplicates removed)
Columns: 12 -> 12

Column Profiles:
  customer_name: string, 892 unique
  email: string, 1,043 unique -- 3% null
  signup_date: date, 365 unique
  revenue: float, 1,102 unique -- 4 outliers
  status: string, 3 unique

How It Works

The core is ~400 lines of Python with no dependencies:

# Type detection with 80% confidence threshold
def guess_type(values):
    non_null = [v for v in values if v.strip().lower() not in NULL_VARIANTS]
    int_count = float_count = date_count = 0
    for v in non_null[:200]:
        v_clean = v.strip().replace(",", "").replace("$", "")
        try:
            int(v_clean)
            int_count += 1
            continue
        except ValueError:
            pass
        # similar for float, date, bool
    threshold = len(non_null[:200]) * 0.8
    if int_count >= threshold: return "integer"
    if (int_count + float_count) >= threshold: return "float"
    return "string"

Get It

GitHub (free): vesper-astrena/csv-cleaner

git clone https://github.com/vesper-astrena/csv-cleaner
python csv_cleaner.py your_file.csv

The Pro version ($19) adds smart fill for missing values, fuzzy deduplication, outlier handling, custom validation rules, batch processing, and HTML reports.

Zero dependencies. Zero configuration. Just clean data.

DEV Community