Yuuichi Eguchi

Posted on Dec 1

I Built a Library to Stop Writing Data Cleaning Code

#python #pandas #dataframe #datascience

The Problem with Data Cleaning

If you've ever worked with real-world data, you know this pain:

import pandas as pd

# Load messy data
df = pd.read_csv("messy_data.csv")

# Manually clean everything
df = df.drop_duplicates()  # Remove duplicates
df["age"] = df["age"].fillna(df["age"].mean())  # Fill missing ages
df["age"] = pd.to_numeric(df["age"], errors="coerce")  # Convert to numeric
df["email"] = df["email"].str.strip().str.lower()  # Clean emails
# ... 50 more lines of cleaning code

# Did I handle outliers? What about type conversion errors?
# What columns did I actually change? No idea.

This approach has serious problems:

Repetitive boilerplate: Every project starts with the same cleaning code
No audit trail: What changed? Which rows were affected? Unknown.
Fragile: One bad value crashes the entire pipeline
Hard to maintain: Cleaning logic scattered across multiple files
Not reusable: Can't save and apply the same rules to new data

After writing this code for the hundredth time at work, I decided enough was enough. I built autoclean-dataframe to solve it.

What I Wanted

A library that would:

✅ Let me define cleaning rules declaratively (config files)
✅ Handle all common operations (missing values, type conversion, outliers, duplicates)
✅ Give me a detailed report of what changed
✅ Be type-safe and fail gracefully
✅ Never modify the original DataFrame

Quick Start: The `auto_clean()` Function

For common scenarios, there's a one-liner:

from autoclean_dataframe import auto_clean

# Apply smart defaults
df_clean, report = auto_clean(df)
print(report)

This automatically:

Removes duplicate rows
Infers and converts types
Detects outliers
Handles empty rows/columns

And you get a detailed report of everything that changed.

Declarative Configuration

For more control, define cleaning rules explicitly:

from autoclean_dataframe import (
    clean_dataframe,
    DataCleanConfig,
    GeneralCleanConfig,
    ColumnConfig,
    TypeConversionConfig,
    MissingValueConfig,
)

config = DataCleanConfig(
    general=GeneralCleanConfig(
        remove_duplicates=True,
        drop_fully_empty_rows=True,
    ),
    columns={
        "age": ColumnConfig(
            column_name="age",
            type_conversion=TypeConversionConfig(target_type="int"),
            missing_values=MissingValueConfig(strategy="mean"),
        ),
        "email": ColumnConfig(
            column_name="email",
            strip_whitespace=True,
            to_lowercase=True,
        ),
    }
)

df_clean, report = clean_dataframe(df, config)

Why declarative?

Cleaning logic is separate from execution
Easy to review and modify
Can be version-controlled
Reusable across datasets

Use YAML/JSON Configuration Files

Save configurations to files for easy reuse:

config.yaml:

general:
  remove_duplicates: true
  drop_fully_empty_rows: true

columns:
  age:
    column_name: age
    type_conversion:
      target_type: int
    missing_values:
      strategy: mean

  email:
    column_name: email
    strip_whitespace: true
    to_lowercase: true

In Python:

from autoclean_dataframe import load_config, clean_dataframe

config = load_config("config.yaml")
df_clean, report = clean_dataframe(df, config)

Now your cleaning logic is portable and shareable.

Comprehensive Cleaning Operations

Missing Value Imputation

MissingValueConfig(
    strategy="mean",  # mean, median, mode, constant, forward_fill, backward_fill, drop_row
    constant_value=0,  # For strategy="constant"
    threshold=0.5,  # Drop column if >50% missing
)

Type Conversion

TypeConversionConfig(
    target_type="int",  # int, float, str, bool, datetime, category
    datetime_format="%Y-%m-%d",
    strict=False,  # Coerce errors to NaN instead of failing
    infer_type=False,  # Auto-detect type
)

Outlier Detection

OutlierConfig(
    method="iqr",  # IQR or Z-score
    action="clip",  # flag, remove, or clip
    iqr_multiplier=1.5,
    zscore_threshold=3.0,
)

PII Masking

PiiConfig(
    pii_type="email",  # email, phone, ssn, credit_card, custom
    custom_pattern=r"\d{3}-\d{2}-\d{4}",
    mask_char="*",
)

Example output:

Email: john@example.com → ***@***.com
Phone: 555-123-4567 → ***-***-4567

Text Normalization

ColumnConfig(
    column_name="name",
    strip_whitespace=True,
    to_lowercase=True,
)

Value Validation

ColumnConfig(
    column_name="status",
    allowed_values=["active", "inactive", "pending"],
)

Detailed Reporting

Every cleaning operation returns a comprehensive report:

df_clean, report = clean_dataframe(df, config)

# Human-readable summary
print(report)

# Export to JSON
report.to_json()

# Save to file
from autoclean_dataframe import save_report
save_report(report, "report.json")
save_report(report, "report.txt")

Example report:

======================================================================
DATA CLEANING REPORT
======================================================================
Timestamp: 2024-01-15T10:30:45.123456

OVERVIEW
----------------------------------------------------------------------
Rows before:           100
Rows after:            95
Rows removed:          5
Columns before:        10
Columns after:         10
Duplicate rows removed: 2

COLUMN CHANGES
----------------------------------------------------------------------
age:
  - Missing values handled: 3
  - Type conversions: 97
  - Outliers detected: 2
  - Outliers clipped: 2

email:
  - Whitespace stripped: 5
  - PII values masked: 100

======================================================================

Why this matters:

Full audit trail for compliance
Debug issues quickly
Understand your data better
Document your cleaning process

Design Principles

The library follows strict principles:

1. Immutable by Default

df_clean, report = clean_dataframe(df, config)
# Original df is unchanged
assert df.equals(original_df)  # True

Never modifies input data. Always returns a new DataFrame.

2. Fail-Safe

Type conversion failures don't crash your pipeline:

TypeConversionConfig(
    target_type="int",
    strict=False,  # Coerce errors to NaN
)

All failures are tracked in the report.

3. Type-Safe

Full type hints with Pydantic validation:

config = DataCleanConfig(...)  # IDE autocomplete works
# Invalid config raises clear error at runtime

4. Traceable

Every change is logged and reported. Nothing happens silently.

Real-World Example: Customer Data Pipeline

import pandas as pd
from autoclean_dataframe import load_config, clean_dataframe, save_report

# Load messy customer data
df = pd.read_csv("customer_data.csv")

# Load cleaning config
config = load_config("customer_clean_config.yaml")

# Clean and report
df_clean, report = clean_dataframe(df, config)

# Save results
df_clean.to_csv("customer_data_clean.csv", index=False)
save_report(report, "cleaning_report.json")

# Check what happened
print(f"Rows removed: {report.rows_removed}")
print(f"Duplicates removed: {report.duplicate_rows_removed}")

customer_clean_config.yaml:

general:
  remove_duplicates: true
  drop_fully_empty_rows: true

columns:
  age:
    type_conversion:
      target_type: int
    missing_values:
      strategy: median
    outliers:
      method: iqr
      action: clip

  email:
    strip_whitespace: true
    to_lowercase: true
    pii:
      pii_type: email

  phone:
    pii:
      pii_type: phone

  status:
    allowed_values: ["active", "inactive", "pending"]

Now you have:

Reproducible cleaning process
Full audit trail
Reusable configuration
Type-safe operations

Use Cases

1. ML Pipeline Preprocessing

# Training data
df_train_clean, _ = auto_clean(df_train)

# Save config
from autoclean_dataframe import save_config
save_config(config, "ml_clean_config.yaml")

# Apply same rules to test data later
config = load_config("ml_clean_config.yaml")
df_test_clean, _ = clean_dataframe(df_test, config)

2. Data Quality Checks

df_clean, report = auto_clean(df, verbose=True)

if report.rows_removed > len(df) * 0.1:
    raise ValueError("Too many rows removed - check data quality")

3. Automated ETL

# Read raw data
df = pd.read_csv("daily_export.csv")

# Apply standard cleaning
config = load_config("standard_clean.yaml")
df_clean, report = clean_dataframe(df, config)

# Save for analytics
df_clean.to_parquet("clean_data.parquet")
save_report(report, f"report_{today}.json")

Installation

pip install autoclean-dataframe

Requirements: Python 3.10+, pandas, pydantic

What I Learned

Building this library taught me:

Declarative > Imperative: Configuration files are easier to understand than code
Reporting is essential: Without an audit trail, cleaning is a black box
Immutability prevents bugs: Never modifying input data eliminates entire classes of errors
Type safety saves time: Pydantic catches config errors before runtime

Try It Out

The library is open source and available on PyPI:

GitHub: https://github.com/yuuichieguchi/autoclean-dataframe
PyPI: pip install autoclean-dataframe
Examples: Check the examples/ directory in the repo

I'd love feedback on:

Missing cleaning operations
API design
Performance with large datasets
Integration ideas

If you've ever been frustrated by repetitive data cleaning, give it a try!

DEV Community

I Built a Library to Stop Writing Data Cleaning Code

The Problem with Data Cleaning

What I Wanted

Quick Start: The `auto_clean()` Function

Declarative Configuration

Use YAML/JSON Configuration Files

Comprehensive Cleaning Operations

Missing Value Imputation

Type Conversion

Outlier Detection

PII Masking

Text Normalization

Value Validation

Detailed Reporting

Design Principles

1. Immutable by Default

2. Fail-Safe

3. Type-Safe

4. Traceable

Real-World Example: Customer Data Pipeline

Use Cases

1. ML Pipeline Preprocessing

2. Data Quality Checks

3. Automated ETL

Installation

What I Learned

Try It Out

Top comments (0)

The Problem with Data Cleaning

What I Wanted

Quick Start: The auto_clean() Function

Declarative Configuration

Use YAML/JSON Configuration Files

Comprehensive Cleaning Operations

Missing Value Imputation

Type Conversion

Outlier Detection

PII Masking

Text Normalization

Value Validation

Detailed Reporting

Design Principles

1. Immutable by Default

2. Fail-Safe

3. Type-Safe

4. Traceable

Real-World Example: Customer Data Pipeline

Use Cases

1. ML Pipeline Preprocessing

2. Data Quality Checks

3. Automated ETL

Installation

What I Learned

Try It Out

Quick Start: The `auto_clean()` Function