DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Data Cleaning Playbook

Data Cleaning Playbook

Production-tested frameworks for data quality: deduplication, standardization, outlier detection, missing value strategies, and validation rules. Every technique includes Python implementation, SQL alternatives, and decision criteria for when to use each approach.

Key Features

  • Deduplication Engine — fuzzy matching, exact matching, and composite key strategies
  • Standardization Recipes — names, addresses, dates, phone numbers, currencies
  • Outlier Detection — IQR, Z-score, isolation forest, and domain-specific rules
  • Missing Value Strategies — imputation decision tree with 8 techniques and when to use each
  • Validation Rule Library — 40+ reusable validation patterns for common data types
  • Data Profiling Scripts — automated data quality reports with completeness, uniqueness, and distribution stats
  • Before/After Audit Trail — track every transformation for reproducibility

Quick Start

from src.profiler import profile_dataframe
from src.dedup import deduplicate
from src.cleaning import clean_pipeline

# 1. Profile your data to identify issues
report = profile_dataframe(df)
print(report.summary())
# Output: 12 columns, 45,231 rows, 3.2% missing, 847 duplicates

# 2. Deduplicate
df_deduped = deduplicate(
    df,
    match_columns=["email", "last_name"],
    strategy="fuzzy",         # "exact" | "fuzzy" | "composite"
    similarity_threshold=0.85,
)

# 3. Run the full cleaning pipeline
df_clean = clean_pipeline(df_deduped, config="config.example.yaml")
Enter fullscreen mode Exit fullscreen mode

Usage Examples

Fuzzy Deduplication in Python

from difflib import SequenceMatcher

def fuzzy_match_score(s1: str, s2: str) -> float:
    """Calculate similarity ratio between two strings."""
    return SequenceMatcher(None, s1.lower().strip(), s2.lower().strip()).ratio()

def find_duplicates(records: list[dict], key: str,
                    threshold: float = 0.85) -> list[tuple[int, int, float]]:
    """Find fuzzy duplicate pairs above similarity threshold."""
    matches = []
    for i in range(len(records)):
        for j in range(i + 1, len(records)):
            score = fuzzy_match_score(records[i][key], records[j][key])
            if score >= threshold:
                matches.append((i, j, score))
    return matches

# Example: find near-duplicate company names
companies = [
    {"name": "Acme Corporation"},
    {"name": "Acme Corp."},
    {"name": "ACME Corp"},
    {"name": "Beta Industries"},
]
dupes = find_duplicates(companies, "name", threshold=0.80)
# [(0, 1, 0.88), (0, 2, 0.86), (1, 2, 0.92)]
Enter fullscreen mode Exit fullscreen mode

SQL: Identifying Duplicates with Window Functions

-- Find duplicate records by email, keep the most recent
WITH ranked AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY LOWER(TRIM(email))
            ORDER BY updated_at DESC
        ) AS rn
    FROM customers
)
SELECT * FROM ranked WHERE rn = 1;  -- Keep winners

-- Audit: count duplicates by column
SELECT
    LOWER(TRIM(email)) AS normalized_email,
    COUNT(*) AS occurrence_count
FROM customers
GROUP BY LOWER(TRIM(email))
HAVING COUNT(*) > 1
ORDER BY occurrence_count DESC;
Enter fullscreen mode Exit fullscreen mode

Configuration

# config.example.yaml
profiling:
  sample_size: 10000            # Rows to profile (null = all)
  completeness_threshold: 0.95  # Flag columns below this

deduplication:
  strategy: "fuzzy"             # exact | fuzzy | composite
  match_columns: ["email", "phone", "last_name"]
  similarity_threshold: 0.85
  keep: "most_recent"           # most_recent | most_complete | first

standardization:
  dates:
    output_format: "YYYY-MM-DD"
    parse_formats: ["MM/DD/YYYY", "DD-Mon-YY", "YYYY/MM/DD"]
  phones:
    country_code: "+1"
    format: "E.164"             # +15551234567
  text:
    case: "title"               # lower | upper | title
    strip_whitespace: true
    remove_special_chars: false

outliers:
  method: "iqr"                 # iqr | zscore | isolation_forest
  iqr_multiplier: 1.5
  action: "flag"                # flag | cap | remove
Enter fullscreen mode Exit fullscreen mode

Best Practices

  1. Profile before cleaning — know your data quality baseline before transforming anything
  2. Never delete data silently — flag or quarantine, and keep an audit log
  3. Clean at the source when possible — fix upstream rather than patching downstream
  4. Test cleaning rules on a sample — validate before applying to millions of rows
  5. Document every transformation — future you will thank present you
  6. Version your cleaning configs — track changes to rules the same way you track code

Troubleshooting

Issue Cause Fix
Fuzzy matching is too slow O(n²) pairwise comparison Use blocking (group by first letter or zip code) to reduce comparisons
Too many false positive duplicates Threshold too low Increase similarity_threshold to 0.90+ or add more match columns
Imputed values distort analysis Wrong imputation strategy Use the decision tree above; consider multiple imputation for statistical analyses
Date parsing fails Ambiguous formats (01/02/03) Require explicit format specification in config

This is 1 of 11 resources in the Data Analyst Toolkit toolkit. Get the complete [Data Cleaning Playbook] with all files, templates, and documentation for $19.

Get the Full Kit →

Or grab the entire Data Analyst Toolkit bundle (11 products) for $129 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)