Every data project starts the same way: you get a CSV and it is a mess. Column names with random spaces, dates in 5 different formats, NULL and N/A and none all meaning the same thing, duplicate rows everywhere.
I built a zero-dependency Python tool that handles all of this automatically.
One Command
python csv_cleaner.py messy_data.csv -o clean_data.csv
No pandas, no pip install, no configuration.
What It Does
Auto-detects encoding
Tries UTF-8, Latin-1, Shift-JIS, and other common encodings automatically.
Standardizes column names
" Email Address " -> "email_address"
"Revenue ($)" -> "revenue"
"Full Name" -> "full_name"
Normalizes null values
All become empty strings: NULL, N/A, None, na, nan, #N/A, -, --, missing, undefined
Fixes whitespace
" Jane Doe " -> "Jane Doe"
"double space" -> "double space"
Removes duplicates, normalizes dates, detects types and outliers
Auto-detects 9+ date formats. Infers column types (integer, float, date, boolean, string). Uses IQR to flag outliers.
The Report
Rows: 1,247 -> 1,198 (49 duplicates removed)
Columns: 12 -> 12
Column Profiles:
customer_name: string, 892 unique
email: string, 1,043 unique -- 3% null
signup_date: date, 365 unique
revenue: float, 1,102 unique -- 4 outliers
status: string, 3 unique
How It Works
The core is ~400 lines of Python with no dependencies:
# Type detection with 80% confidence threshold
def guess_type(values):
non_null = [v for v in values if v.strip().lower() not in NULL_VARIANTS]
int_count = float_count = date_count = 0
for v in non_null[:200]:
v_clean = v.strip().replace(",", "").replace("$", "")
try:
int(v_clean)
int_count += 1
continue
except ValueError:
pass
# similar for float, date, bool
threshold = len(non_null[:200]) * 0.8
if int_count >= threshold: return "integer"
if (int_count + float_count) >= threshold: return "float"
return "string"
Get It
GitHub (free): vesper-astrena/csv-cleaner
git clone https://github.com/vesper-astrena/csv-cleaner
python csv_cleaner.py your_file.csv
The Pro version ($19) adds smart fill for missing values, fuzzy deduplication, outlier handling, custom validation rules, batch processing, and HTML reports.
Zero dependencies. Zero configuration. Just clean data.
Top comments (0)