Data Cleaning Playbook
Production-tested frameworks for data quality: deduplication, standardization, outlier detection, missing value strategies, and validation rules. Every technique includes Python implementation, SQL alternatives, and decision criteria for when to use each approach.
Key Features
- Deduplication Engine — fuzzy matching, exact matching, and composite key strategies
- Standardization Recipes — names, addresses, dates, phone numbers, currencies
- Outlier Detection — IQR, Z-score, isolation forest, and domain-specific rules
- Missing Value Strategies — imputation decision tree with 8 techniques and when to use each
- Validation Rule Library — 40+ reusable validation patterns for common data types
- Data Profiling Scripts — automated data quality reports with completeness, uniqueness, and distribution stats
- Before/After Audit Trail — track every transformation for reproducibility
Quick Start
from src.profiler import profile_dataframe
from src.dedup import deduplicate
from src.cleaning import clean_pipeline
# 1. Profile your data to identify issues
report = profile_dataframe(df)
print(report.summary())
# Output: 12 columns, 45,231 rows, 3.2% missing, 847 duplicates
# 2. Deduplicate
df_deduped = deduplicate(
df,
match_columns=["email", "last_name"],
strategy="fuzzy", # "exact" | "fuzzy" | "composite"
similarity_threshold=0.85,
)
# 3. Run the full cleaning pipeline
df_clean = clean_pipeline(df_deduped, config="config.example.yaml")
Usage Examples
Fuzzy Deduplication in Python
from difflib import SequenceMatcher
def fuzzy_match_score(s1: str, s2: str) -> float:
"""Calculate similarity ratio between two strings."""
return SequenceMatcher(None, s1.lower().strip(), s2.lower().strip()).ratio()
def find_duplicates(records: list[dict], key: str,
threshold: float = 0.85) -> list[tuple[int, int, float]]:
"""Find fuzzy duplicate pairs above similarity threshold."""
matches = []
for i in range(len(records)):
for j in range(i + 1, len(records)):
score = fuzzy_match_score(records[i][key], records[j][key])
if score >= threshold:
matches.append((i, j, score))
return matches
# Example: find near-duplicate company names
companies = [
{"name": "Acme Corporation"},
{"name": "Acme Corp."},
{"name": "ACME Corp"},
{"name": "Beta Industries"},
]
dupes = find_duplicates(companies, "name", threshold=0.80)
# [(0, 1, 0.88), (0, 2, 0.86), (1, 2, 0.92)]
SQL: Identifying Duplicates with Window Functions
-- Find duplicate records by email, keep the most recent
WITH ranked AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY LOWER(TRIM(email))
ORDER BY updated_at DESC
) AS rn
FROM customers
)
SELECT * FROM ranked WHERE rn = 1; -- Keep winners
-- Audit: count duplicates by column
SELECT
LOWER(TRIM(email)) AS normalized_email,
COUNT(*) AS occurrence_count
FROM customers
GROUP BY LOWER(TRIM(email))
HAVING COUNT(*) > 1
ORDER BY occurrence_count DESC;
Configuration
# config.example.yaml
profiling:
sample_size: 10000 # Rows to profile (null = all)
completeness_threshold: 0.95 # Flag columns below this
deduplication:
strategy: "fuzzy" # exact | fuzzy | composite
match_columns: ["email", "phone", "last_name"]
similarity_threshold: 0.85
keep: "most_recent" # most_recent | most_complete | first
standardization:
dates:
output_format: "YYYY-MM-DD"
parse_formats: ["MM/DD/YYYY", "DD-Mon-YY", "YYYY/MM/DD"]
phones:
country_code: "+1"
format: "E.164" # +15551234567
text:
case: "title" # lower | upper | title
strip_whitespace: true
remove_special_chars: false
outliers:
method: "iqr" # iqr | zscore | isolation_forest
iqr_multiplier: 1.5
action: "flag" # flag | cap | remove
Best Practices
- Profile before cleaning — know your data quality baseline before transforming anything
- Never delete data silently — flag or quarantine, and keep an audit log
- Clean at the source when possible — fix upstream rather than patching downstream
- Test cleaning rules on a sample — validate before applying to millions of rows
- Document every transformation — future you will thank present you
- Version your cleaning configs — track changes to rules the same way you track code
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
| Fuzzy matching is too slow | O(n²) pairwise comparison | Use blocking (group by first letter or zip code) to reduce comparisons |
| Too many false positive duplicates | Threshold too low | Increase similarity_threshold to 0.90+ or add more match columns |
| Imputed values distort analysis | Wrong imputation strategy | Use the decision tree above; consider multiple imputation for statistical analyses |
| Date parsing fails | Ambiguous formats (01/02/03) | Require explicit format specification in config |
This is 1 of 11 resources in the Data Analyst Toolkit toolkit. Get the complete [Data Cleaning Playbook] with all files, templates, and documentation for $19.
Or grab the entire Data Analyst Toolkit bundle (11 products) for $129 — save 30%.
Top comments (0)