The Problem with Data Cleaning
If you've ever worked with real-world data, you know this pain:
import pandas as pd
# Load messy data
df = pd.read_csv("messy_data.csv")
# Manually clean everything
df = df.drop_duplicates() # Remove duplicates
df["age"] = df["age"].fillna(df["age"].mean()) # Fill missing ages
df["age"] = pd.to_numeric(df["age"], errors="coerce") # Convert to numeric
df["email"] = df["email"].str.strip().str.lower() # Clean emails
# ... 50 more lines of cleaning code
# Did I handle outliers? What about type conversion errors?
# What columns did I actually change? No idea.
This approach has serious problems:
- Repetitive boilerplate: Every project starts with the same cleaning code
- No audit trail: What changed? Which rows were affected? Unknown.
- Fragile: One bad value crashes the entire pipeline
- Hard to maintain: Cleaning logic scattered across multiple files
- Not reusable: Can't save and apply the same rules to new data
After writing this code for the hundredth time at work, I decided enough was enough. I built autoclean-dataframe to solve it.
What I Wanted
A library that would:
- ✅ Let me define cleaning rules declaratively (config files)
- ✅ Handle all common operations (missing values, type conversion, outliers, duplicates)
- ✅ Give me a detailed report of what changed
- ✅ Be type-safe and fail gracefully
- ✅ Never modify the original DataFrame
Quick Start: The auto_clean() Function
For common scenarios, there's a one-liner:
from autoclean_dataframe import auto_clean
# Apply smart defaults
df_clean, report = auto_clean(df)
print(report)
This automatically:
- Removes duplicate rows
- Infers and converts types
- Detects outliers
- Handles empty rows/columns
And you get a detailed report of everything that changed.
Declarative Configuration
For more control, define cleaning rules explicitly:
from autoclean_dataframe import (
clean_dataframe,
DataCleanConfig,
GeneralCleanConfig,
ColumnConfig,
TypeConversionConfig,
MissingValueConfig,
)
config = DataCleanConfig(
general=GeneralCleanConfig(
remove_duplicates=True,
drop_fully_empty_rows=True,
),
columns={
"age": ColumnConfig(
column_name="age",
type_conversion=TypeConversionConfig(target_type="int"),
missing_values=MissingValueConfig(strategy="mean"),
),
"email": ColumnConfig(
column_name="email",
strip_whitespace=True,
to_lowercase=True,
),
}
)
df_clean, report = clean_dataframe(df, config)
Why declarative?
- Cleaning logic is separate from execution
- Easy to review and modify
- Can be version-controlled
- Reusable across datasets
Use YAML/JSON Configuration Files
Save configurations to files for easy reuse:
config.yaml:
general:
remove_duplicates: true
drop_fully_empty_rows: true
columns:
age:
column_name: age
type_conversion:
target_type: int
missing_values:
strategy: mean
email:
column_name: email
strip_whitespace: true
to_lowercase: true
In Python:
from autoclean_dataframe import load_config, clean_dataframe
config = load_config("config.yaml")
df_clean, report = clean_dataframe(df, config)
Now your cleaning logic is portable and shareable.
Comprehensive Cleaning Operations
Missing Value Imputation
MissingValueConfig(
strategy="mean", # mean, median, mode, constant, forward_fill, backward_fill, drop_row
constant_value=0, # For strategy="constant"
threshold=0.5, # Drop column if >50% missing
)
Type Conversion
TypeConversionConfig(
target_type="int", # int, float, str, bool, datetime, category
datetime_format="%Y-%m-%d",
strict=False, # Coerce errors to NaN instead of failing
infer_type=False, # Auto-detect type
)
Outlier Detection
OutlierConfig(
method="iqr", # IQR or Z-score
action="clip", # flag, remove, or clip
iqr_multiplier=1.5,
zscore_threshold=3.0,
)
PII Masking
PiiConfig(
pii_type="email", # email, phone, ssn, credit_card, custom
custom_pattern=r"\d{3}-\d{2}-\d{4}",
mask_char="*",
)
Example output:
- Email:
john@example.com→***@***.com - Phone:
555-123-4567→***-***-4567
Text Normalization
ColumnConfig(
column_name="name",
strip_whitespace=True,
to_lowercase=True,
)
Value Validation
ColumnConfig(
column_name="status",
allowed_values=["active", "inactive", "pending"],
)
Detailed Reporting
Every cleaning operation returns a comprehensive report:
df_clean, report = clean_dataframe(df, config)
# Human-readable summary
print(report)
# Export to JSON
report.to_json()
# Save to file
from autoclean_dataframe import save_report
save_report(report, "report.json")
save_report(report, "report.txt")
Example report:
======================================================================
DATA CLEANING REPORT
======================================================================
Timestamp: 2024-01-15T10:30:45.123456
OVERVIEW
----------------------------------------------------------------------
Rows before: 100
Rows after: 95
Rows removed: 5
Columns before: 10
Columns after: 10
Duplicate rows removed: 2
COLUMN CHANGES
----------------------------------------------------------------------
age:
- Missing values handled: 3
- Type conversions: 97
- Outliers detected: 2
- Outliers clipped: 2
email:
- Whitespace stripped: 5
- PII values masked: 100
======================================================================
Why this matters:
- Full audit trail for compliance
- Debug issues quickly
- Understand your data better
- Document your cleaning process
Design Principles
The library follows strict principles:
1. Immutable by Default
df_clean, report = clean_dataframe(df, config)
# Original df is unchanged
assert df.equals(original_df) # True
Never modifies input data. Always returns a new DataFrame.
2. Fail-Safe
Type conversion failures don't crash your pipeline:
TypeConversionConfig(
target_type="int",
strict=False, # Coerce errors to NaN
)
All failures are tracked in the report.
3. Type-Safe
Full type hints with Pydantic validation:
config = DataCleanConfig(...) # IDE autocomplete works
# Invalid config raises clear error at runtime
4. Traceable
Every change is logged and reported. Nothing happens silently.
Real-World Example: Customer Data Pipeline
import pandas as pd
from autoclean_dataframe import load_config, clean_dataframe, save_report
# Load messy customer data
df = pd.read_csv("customer_data.csv")
# Load cleaning config
config = load_config("customer_clean_config.yaml")
# Clean and report
df_clean, report = clean_dataframe(df, config)
# Save results
df_clean.to_csv("customer_data_clean.csv", index=False)
save_report(report, "cleaning_report.json")
# Check what happened
print(f"Rows removed: {report.rows_removed}")
print(f"Duplicates removed: {report.duplicate_rows_removed}")
customer_clean_config.yaml:
general:
remove_duplicates: true
drop_fully_empty_rows: true
columns:
age:
type_conversion:
target_type: int
missing_values:
strategy: median
outliers:
method: iqr
action: clip
email:
strip_whitespace: true
to_lowercase: true
pii:
pii_type: email
phone:
pii:
pii_type: phone
status:
allowed_values: ["active", "inactive", "pending"]
Now you have:
- Reproducible cleaning process
- Full audit trail
- Reusable configuration
- Type-safe operations
Use Cases
1. ML Pipeline Preprocessing
# Training data
df_train_clean, _ = auto_clean(df_train)
# Save config
from autoclean_dataframe import save_config
save_config(config, "ml_clean_config.yaml")
# Apply same rules to test data later
config = load_config("ml_clean_config.yaml")
df_test_clean, _ = clean_dataframe(df_test, config)
2. Data Quality Checks
df_clean, report = auto_clean(df, verbose=True)
if report.rows_removed > len(df) * 0.1:
raise ValueError("Too many rows removed - check data quality")
3. Automated ETL
# Read raw data
df = pd.read_csv("daily_export.csv")
# Apply standard cleaning
config = load_config("standard_clean.yaml")
df_clean, report = clean_dataframe(df, config)
# Save for analytics
df_clean.to_parquet("clean_data.parquet")
save_report(report, f"report_{today}.json")
Installation
pip install autoclean-dataframe
Requirements: Python 3.10+, pandas, pydantic
What I Learned
Building this library taught me:
- Declarative > Imperative: Configuration files are easier to understand than code
- Reporting is essential: Without an audit trail, cleaning is a black box
- Immutability prevents bugs: Never modifying input data eliminates entire classes of errors
- Type safety saves time: Pydantic catches config errors before runtime
Try It Out
The library is open source and available on PyPI:
- GitHub: https://github.com/yuuichieguchi/autoclean-dataframe
-
PyPI:
pip install autoclean-dataframe -
Examples: Check the
examples/directory in the repo
I'd love feedback on:
- Missing cleaning operations
- API design
- Performance with large datasets
- Integration ideas
If you've ever been frustrated by repetitive data cleaning, give it a try!
Top comments (0)