DEV Community

Yuuichi Eguchi
Yuuichi Eguchi

Posted on

I Built a Library to Stop Writing Data Cleaning Code

The Problem with Data Cleaning

If you've ever worked with real-world data, you know this pain:

import pandas as pd

# Load messy data
df = pd.read_csv("messy_data.csv")

# Manually clean everything
df = df.drop_duplicates()  # Remove duplicates
df["age"] = df["age"].fillna(df["age"].mean())  # Fill missing ages
df["age"] = pd.to_numeric(df["age"], errors="coerce")  # Convert to numeric
df["email"] = df["email"].str.strip().str.lower()  # Clean emails
# ... 50 more lines of cleaning code

# Did I handle outliers? What about type conversion errors?
# What columns did I actually change? No idea.
Enter fullscreen mode Exit fullscreen mode

This approach has serious problems:

  1. Repetitive boilerplate: Every project starts with the same cleaning code
  2. No audit trail: What changed? Which rows were affected? Unknown.
  3. Fragile: One bad value crashes the entire pipeline
  4. Hard to maintain: Cleaning logic scattered across multiple files
  5. Not reusable: Can't save and apply the same rules to new data

After writing this code for the hundredth time at work, I decided enough was enough. I built autoclean-dataframe to solve it.


What I Wanted

A library that would:

  • ✅ Let me define cleaning rules declaratively (config files)
  • ✅ Handle all common operations (missing values, type conversion, outliers, duplicates)
  • ✅ Give me a detailed report of what changed
  • ✅ Be type-safe and fail gracefully
  • ✅ Never modify the original DataFrame

Quick Start: The auto_clean() Function

For common scenarios, there's a one-liner:

from autoclean_dataframe import auto_clean

# Apply smart defaults
df_clean, report = auto_clean(df)
print(report)
Enter fullscreen mode Exit fullscreen mode

This automatically:

  • Removes duplicate rows
  • Infers and converts types
  • Detects outliers
  • Handles empty rows/columns

And you get a detailed report of everything that changed.


Declarative Configuration

For more control, define cleaning rules explicitly:

from autoclean_dataframe import (
    clean_dataframe,
    DataCleanConfig,
    GeneralCleanConfig,
    ColumnConfig,
    TypeConversionConfig,
    MissingValueConfig,
)

config = DataCleanConfig(
    general=GeneralCleanConfig(
        remove_duplicates=True,
        drop_fully_empty_rows=True,
    ),
    columns={
        "age": ColumnConfig(
            column_name="age",
            type_conversion=TypeConversionConfig(target_type="int"),
            missing_values=MissingValueConfig(strategy="mean"),
        ),
        "email": ColumnConfig(
            column_name="email",
            strip_whitespace=True,
            to_lowercase=True,
        ),
    }
)

df_clean, report = clean_dataframe(df, config)
Enter fullscreen mode Exit fullscreen mode

Why declarative?

  • Cleaning logic is separate from execution
  • Easy to review and modify
  • Can be version-controlled
  • Reusable across datasets

Use YAML/JSON Configuration Files

Save configurations to files for easy reuse:

config.yaml:

general:
  remove_duplicates: true
  drop_fully_empty_rows: true

columns:
  age:
    column_name: age
    type_conversion:
      target_type: int
    missing_values:
      strategy: mean

  email:
    column_name: email
    strip_whitespace: true
    to_lowercase: true
Enter fullscreen mode Exit fullscreen mode

In Python:

from autoclean_dataframe import load_config, clean_dataframe

config = load_config("config.yaml")
df_clean, report = clean_dataframe(df, config)
Enter fullscreen mode Exit fullscreen mode

Now your cleaning logic is portable and shareable.


Comprehensive Cleaning Operations

Missing Value Imputation

MissingValueConfig(
    strategy="mean",  # mean, median, mode, constant, forward_fill, backward_fill, drop_row
    constant_value=0,  # For strategy="constant"
    threshold=0.5,  # Drop column if >50% missing
)
Enter fullscreen mode Exit fullscreen mode

Type Conversion

TypeConversionConfig(
    target_type="int",  # int, float, str, bool, datetime, category
    datetime_format="%Y-%m-%d",
    strict=False,  # Coerce errors to NaN instead of failing
    infer_type=False,  # Auto-detect type
)
Enter fullscreen mode Exit fullscreen mode

Outlier Detection

OutlierConfig(
    method="iqr",  # IQR or Z-score
    action="clip",  # flag, remove, or clip
    iqr_multiplier=1.5,
    zscore_threshold=3.0,
)
Enter fullscreen mode Exit fullscreen mode

PII Masking

PiiConfig(
    pii_type="email",  # email, phone, ssn, credit_card, custom
    custom_pattern=r"\d{3}-\d{2}-\d{4}",
    mask_char="*",
)
Enter fullscreen mode Exit fullscreen mode

Example output:

  • Email: john@example.com***@***.com
  • Phone: 555-123-4567***-***-4567

Text Normalization

ColumnConfig(
    column_name="name",
    strip_whitespace=True,
    to_lowercase=True,
)
Enter fullscreen mode Exit fullscreen mode

Value Validation

ColumnConfig(
    column_name="status",
    allowed_values=["active", "inactive", "pending"],
)
Enter fullscreen mode Exit fullscreen mode

Detailed Reporting

Every cleaning operation returns a comprehensive report:

df_clean, report = clean_dataframe(df, config)

# Human-readable summary
print(report)

# Export to JSON
report.to_json()

# Save to file
from autoclean_dataframe import save_report
save_report(report, "report.json")
save_report(report, "report.txt")
Enter fullscreen mode Exit fullscreen mode

Example report:

======================================================================
DATA CLEANING REPORT
======================================================================
Timestamp: 2024-01-15T10:30:45.123456

OVERVIEW
----------------------------------------------------------------------
Rows before:           100
Rows after:            95
Rows removed:          5
Columns before:        10
Columns after:         10
Duplicate rows removed: 2

COLUMN CHANGES
----------------------------------------------------------------------
age:
  - Missing values handled: 3
  - Type conversions: 97
  - Outliers detected: 2
  - Outliers clipped: 2

email:
  - Whitespace stripped: 5
  - PII values masked: 100

======================================================================
Enter fullscreen mode Exit fullscreen mode

Why this matters:

  • Full audit trail for compliance
  • Debug issues quickly
  • Understand your data better
  • Document your cleaning process

Design Principles

The library follows strict principles:

1. Immutable by Default

df_clean, report = clean_dataframe(df, config)
# Original df is unchanged
assert df.equals(original_df)  # True
Enter fullscreen mode Exit fullscreen mode

Never modifies input data. Always returns a new DataFrame.

2. Fail-Safe

Type conversion failures don't crash your pipeline:

TypeConversionConfig(
    target_type="int",
    strict=False,  # Coerce errors to NaN
)
Enter fullscreen mode Exit fullscreen mode

All failures are tracked in the report.

3. Type-Safe

Full type hints with Pydantic validation:

config = DataCleanConfig(...)  # IDE autocomplete works
# Invalid config raises clear error at runtime
Enter fullscreen mode Exit fullscreen mode

4. Traceable

Every change is logged and reported. Nothing happens silently.


Real-World Example: Customer Data Pipeline

import pandas as pd
from autoclean_dataframe import load_config, clean_dataframe, save_report

# Load messy customer data
df = pd.read_csv("customer_data.csv")

# Load cleaning config
config = load_config("customer_clean_config.yaml")

# Clean and report
df_clean, report = clean_dataframe(df, config)

# Save results
df_clean.to_csv("customer_data_clean.csv", index=False)
save_report(report, "cleaning_report.json")

# Check what happened
print(f"Rows removed: {report.rows_removed}")
print(f"Duplicates removed: {report.duplicate_rows_removed}")
Enter fullscreen mode Exit fullscreen mode

customer_clean_config.yaml:

general:
  remove_duplicates: true
  drop_fully_empty_rows: true

columns:
  age:
    type_conversion:
      target_type: int
    missing_values:
      strategy: median
    outliers:
      method: iqr
      action: clip

  email:
    strip_whitespace: true
    to_lowercase: true
    pii:
      pii_type: email

  phone:
    pii:
      pii_type: phone

  status:
    allowed_values: ["active", "inactive", "pending"]
Enter fullscreen mode Exit fullscreen mode

Now you have:

  • Reproducible cleaning process
  • Full audit trail
  • Reusable configuration
  • Type-safe operations

Use Cases

1. ML Pipeline Preprocessing

# Training data
df_train_clean, _ = auto_clean(df_train)

# Save config
from autoclean_dataframe import save_config
save_config(config, "ml_clean_config.yaml")

# Apply same rules to test data later
config = load_config("ml_clean_config.yaml")
df_test_clean, _ = clean_dataframe(df_test, config)
Enter fullscreen mode Exit fullscreen mode

2. Data Quality Checks

df_clean, report = auto_clean(df, verbose=True)

if report.rows_removed > len(df) * 0.1:
    raise ValueError("Too many rows removed - check data quality")
Enter fullscreen mode Exit fullscreen mode

3. Automated ETL

# Read raw data
df = pd.read_csv("daily_export.csv")

# Apply standard cleaning
config = load_config("standard_clean.yaml")
df_clean, report = clean_dataframe(df, config)

# Save for analytics
df_clean.to_parquet("clean_data.parquet")
save_report(report, f"report_{today}.json")
Enter fullscreen mode Exit fullscreen mode

Installation

pip install autoclean-dataframe
Enter fullscreen mode Exit fullscreen mode

Requirements: Python 3.10+, pandas, pydantic


What I Learned

Building this library taught me:

  • Declarative > Imperative: Configuration files are easier to understand than code
  • Reporting is essential: Without an audit trail, cleaning is a black box
  • Immutability prevents bugs: Never modifying input data eliminates entire classes of errors
  • Type safety saves time: Pydantic catches config errors before runtime

Try It Out

The library is open source and available on PyPI:

I'd love feedback on:

  • Missing cleaning operations
  • API design
  • Performance with large datasets
  • Integration ideas

If you've ever been frustrated by repetitive data cleaning, give it a try!


Top comments (0)