DEV Community

Nico Reyes
Nico Reyes

Posted on

I automated my CSV cleanup with Python (saved 2 hours every week)

I automated my CSV cleanup with Python (saved 2 hours every week)

Every Monday morning. Two hours cleaning CSV exports.

Same boring work, different data. Fixed headers, removed duplicates, standardized formats. Every week.

The Problem

I export product data from different sources. Amazon scrapers, supplier sheets, manual research. Each CSV formatted completely differently.

Headers inconsistent. "Product Name" vs "product_name" vs "ProductName". Pick one please.

Duplicates everywhere. Same product ID, different rows. Why.

Prices as strings. "$19.99" not 19.99. Excel hates this.

Empty rows scattered throughout. Date formats all over the place. MM/DD/YYYY, YYYY-MM-DD, timestamps, whatever the system felt like exporting that day.

What I Tried First

Manual cleanup in Excel. Filter, sort, find duplicates, fix one by one.

Worked for 10 rows. Died at 500.

Tried Google Sheets formulas. Got messy fast. Broke when column order changed. Gave up.

The Solution

Built a Python script that does the cleanup automatically:

import pandas as pd
import re
from datetime import datetime

def clean_csv(input_file, output_file):
    # Read CSV
    df = pd.read_csv(input_file)

    # Normalize column names (lowercase, underscores)
    df.columns = [col.lower().replace(' ', '_') for col in df.columns]

    # Remove empty rows
    df = df.dropna(how='all')

    # Remove duplicates (by product_id if exists)
    if 'product_id' in df.columns:
        df = df.drop_duplicates(subset=['product_id'], keep='first')

    # Clean price column (remove $ and convert to float)
    if 'price' in df.columns:
        df['price'] = df['price'].astype(str).str.replace('$', '').str.replace(',', '')
        df['price'] = pd.to_numeric(df['price'], errors='coerce')

    # Standardize dates
    date_columns = [col for col in df.columns if 'date' in col.lower()]
    for col in date_columns:
        df[col] = pd.to_datetime(df[col], errors='coerce')
        df[col] = df[col].dt.strftime('%Y-%m-%d')

    # Save cleaned CSV
    df.to_csv(output_file, index=False)
    print(f"Cleaned {len(df)} rows โ†’ {output_file}")

# Usage
clean_csv('raw_products.csv', 'clean_products.csv')
Enter fullscreen mode Exit fullscreen mode

Now I drop CSVs into a folder, run the script, get clean data in 10 seconds.

Script handles column name variations automatically. Finds and removes duplicates based on product ID. Converts prices to numbers. No more Excel errors screaming at me about text formatting.

Standardizes dates to YYYY-MM-DD. Removes empty garbage rows.

Still gotta manually check edge cases. Weird characters in product names. Null prices that should be 0. But like 90% automated now.

The Annoying Part

Script works great for supplier CSVs. Doesn't handle scraped data as well tho.

When I scrape Amazon or eBay with ParseForge, the data comes cleaner already but sometimes has fields the script doesn't expect. Column names are different.

Ended up adding a config file to map column names:

# column_mappings.json
{
  "asin": "product_id",
  "item_name": "product_name",
  "current_price": "price"
}
Enter fullscreen mode Exit fullscreen mode

Not perfect. Still breaks on weird supplier exports sometimes. But way better than doing it all manually like an idiot

Top comments (0)