Data Cleaning Automation: The Complete Python Guide
Data scientists spend 80% of their time cleaning data. Here's how I automated this process with Python.
The Data Cleaning Pipeline
import pandas as pd
import numpy as np
class DataCleaner:
def __init__(self, df):
self.df = df
self.report = {}
def remove_duplicates(self):
before = len(self.df)
self.df = self.df.drop_duplicates()
self.report['duplicates_removed'] = before - len(self.df)
return self
def fill_missing(self, strategy='auto'):
for col in self.df.columns:
if self.df[col].dtype in ['float64', 'int64']:
self.df[col] = self.df[col].fillna(self.df[col].median())
else:
self.df[col] = self.df[col].fillna(self.df[col].mode()[0])
return self
def standardize_dates(self, date_cols=None):
if date_cols is None:
date_cols = [c for c in self.df.columns if 'date' in c.lower()]
for col in date_cols:
self.df[col] = pd.to_datetime(self.df[col], errors='coerce')
return self
def clean(self):
return (self.remove_duplicates()
.fill_missing()
.standardize_dates())
# Usage
df = pd.read_csv('messy_data.csv')
cleaner = DataCleaner(df)
clean_df = cleaner.clean().df
print(cleaner.report)
Key Features
- Automatic type detection and conversion
- Smart missing value handling
- Outlier detection and removal
- Date standardization
- Text cleaning and normalization
Complete Data Analysis Toolkit
My complete toolkit includes:
- Data cleaning automation
- Visualization generators
- Statistical analysis tools
- Report generation (PDF/Excel)
- API data collection
Follow me for data science tips and Python automation!
Top comments (0)