DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Data Cleaning & Analysis Templates

Data Cleaning & Analysis Templates

A collection of spreadsheet templates with built-in validation formulas, deduplication tools, data normalization functions, and pivot table templates. Clean messy data without writing code — just paste your data and let the formulas work.

Key Features

  • Validation Engine — 30+ formulas to detect invalid emails, phone numbers, dates, duplicates, and outliers
  • Deduplication Tools — exact match, fuzzy match, and configurable similarity threshold dedup
  • Data Normalization — standardize names, addresses, phone formats, dates, and currency values
  • Outlier Detection — IQR-based and Z-score methods with automatic flagging
  • Pivot Table Templates — 10 pre-built pivot configurations for common analysis patterns
  • Audit Trail — track what was cleaned, when, and what the original value was

What's Included

Sheet Tab Description
Data Input Paste raw data here — validation runs automatically
Validation Report Summary of all detected issues by column and type
Dedup Tool Duplicate finder with configurable match criteria
Normalizer Side-by-side original vs. cleaned values
Outlier Analysis Statistical outlier detection with visual flags
Pivot Templates 10 ready-to-use pivot table configurations
Formula Library Reference sheet with all 30+ validation formulas
Config Validation rules, thresholds, and format settings

Quick Start

  1. Open the Data Input tab and paste your raw dataset (up to 50,000 rows)
  2. Check the Validation Report tab — issues are flagged automatically with color coding
  3. Use the Dedup Tool tab to find and mark duplicates
  4. Run data through the Normalizer tab for standardized output
  5. Review Outlier Analysis for statistical anomalies
  6. Export the cleaned data from the Normalizer output columns

Formula Examples

Email Validation

=AND(
  ISERROR(FIND(" ", A2)),
  LEN(A2) - LEN(SUBSTITUTE(A2, "@", "")) = 1,
  FIND("@", A2) > 1,
  FIND(".", A2, FIND("@", A2)) > FIND("@", A2) + 1,
  LEN(A2) - FIND(".", A2, FIND("@", A2)) >= 2
)
Enter fullscreen mode Exit fullscreen mode

Duplicate Detection (Exact Match)

=IF(COUNTIF($A$2:$A2, A2) > 1, "DUPLICATE", "UNIQUE")
Enter fullscreen mode Exit fullscreen mode

Fuzzy Duplicate Detection (Similarity Score)

=1 - (ABS(LEN(A2)-LEN(B2)) + SUMPRODUCT((MID(UPPER(A2),ROW(INDIRECT("1:"&LEN(A2))),1)<>MID(UPPER(B2),ROW(INDIRECT("1:"&LEN(B2))),1))*1)) / MAX(LEN(A2),LEN(B2))
Enter fullscreen mode Exit fullscreen mode

Phone Number Normalization (US Format)

=IF(LEN(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A2,"-",""),"(",""),")","")," ",""))=10,
  "("&LEFT(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A2,"-",""),"(",""),")","")," ",""),3)&") "&MID(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A2,"-",""),"(",""),")","")," ",""),4,3)&"-"&RIGHT(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A2,"-",""),"(",""),")","")," ",""),4),
  "INVALID"
)
Enter fullscreen mode Exit fullscreen mode

Outlier Detection (IQR Method)

=IF(OR(
  A2 < QUARTILE($A$2:$A$1000,1) - 1.5*(QUARTILE($A$2:$A$1000,3)-QUARTILE($A$2:$A$1000,1)),
  A2 > QUARTILE($A$2:$A$1000,3) + 1.5*(QUARTILE($A$2:$A$1000,3)-QUARTILE($A$2:$A$1000,1))
), "OUTLIER", "OK")
Enter fullscreen mode Exit fullscreen mode

Date Standardization (to YYYY-MM-DD)

=TEXT(DATEVALUE(SUBSTITUTE(SUBSTITUTE(A2,"/","-"),".","-")), "YYYY-MM-DD")
Enter fullscreen mode Exit fullscreen mode

Name Case Normalization

=PROPER(TRIM(CLEAN(SUBSTITUTE(A2, CHAR(160), " "))))
Enter fullscreen mode Exit fullscreen mode

Whitespace Cleanup

=TRIM(CLEAN(SUBSTITUTE(A2, CHAR(160), " ")))
Enter fullscreen mode Exit fullscreen mode
  • Removes leading/trailing spaces, non-printable characters, and non-breaking spaces

Numeric Extraction from Mixed Text

=SUMPRODUCT(MID(0&A2, LARGE(INDEX(ISNUMBER(--MID(A2,ROW(INDIRECT("1:"&LEN(A2))),1))*ROW(INDIRECT("1:"&LEN(A2))),0),ROW(INDIRECT("1:"&LEN(A2))))+1,1)*10^ROW(INDIRECT("1:"&LEN(A2)))/10)
Enter fullscreen mode Exit fullscreen mode
  • Extracts "12345" from "Order #12345-A"

Validation Summary Count

=COUNTIF(ValidationReport!$C:$C, "INVALID")
Enter fullscreen mode Exit fullscreen mode

Usage Tips

  • Color Coding: Red = invalid/must fix, Yellow = suspicious/review, Green = valid
  • Threshold Config: Adjust similarity thresholds in the Config tab (default: 85% for fuzzy match)
  • Large Datasets: For 10,000+ rows, copy formulas in batches of 1,000 to avoid slowdowns
  • Preserve Originals: The Normalizer always keeps original values in column A; cleaned values appear in column B
  • Custom Rules: Add your own validation regex patterns in the Config tab row 20+

Best Practices

  1. Always keep a backup of your raw data before cleaning
  2. Validate first, clean second — understand the scope of issues before transforming
  3. Handle blanks explicitly — decide whether empty cells mean "unknown" or "not applicable"
  4. Document your cleaning rules — use the Audit Trail tab to record decisions
  5. Spot-check results — randomly sample 5% of cleaned rows to verify accuracy

This is 1 of 11 resources in the Spreadsheet Tools Pro toolkit. Get the complete [Data Cleaning & Analysis Templates] with all files, templates, and documentation for $29.

Get the Full Kit →

Or grab the entire Spreadsheet Tools Pro bundle (11 products) for $149 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)