Mohammad Ehsan Ansari

Posted on Dec 9, 2025

I Tried Cleaning 5,000+ Rows of Messy Excel Data… and Realized the Problem Is Bigger Than I Thought

#saas #msexcel #googlesheet #csv

I’ve spent the last few months cleaning spreadsheets — not because I love it (I don’t), but because I was building something that required understanding how bad real-world data actually is.

If you’ve ever worked with CSVs exported from different tools, you already know the pain:

male, Male, MALE, m
₹20,000, 20k, 20000.00, ₹ 20,000/-
Dates like 12/03/24, 2024-03-12, and 03-12-2024 all in the same sheet
Category columns where “Electronics”, “Electronic”, “Elec.” all refer to the same thing
Rows where values literally belong to different columns
Two sheets that should merge cleanly… but don’t

None of this is new. But when you try automating it, you realize something:

Cleaning data isn’t “just clean the data.”

It’s a chain of dependent decisions.

For context, I’m building a tool called RowTidy — an AI-powered data cleaning system — mostly because of these discoveries.

🔗 https://rowtidy.com

This post is not promotional — it’s the story of what I learned while trying to make sense of messy spreadsheets.

🧩 1. No two spreadsheets are ever the same

Even inside the same organization, I’ve seen:

Qty, quantity, QTY, QTY. , QTY ( )
Customer Name, Client Name, Name
Price, Rate, Cost, Total Price

Spreadsheet headers evolve based on:

who created them
which department exported them
which software wrote them
how fast the person writing it was in a rush

Spreadsheets reflect human chaos.

🧹 2. Hardcoded rules always break on the next file

You can write rules:

trim spaces
fix capitalization
remove symbols
normalize dates
normalize currency

…but then you get a file that uses one of those “dirty” formats intentionally.

Or a system where L means lakhs, but another team uses k.

Or numerical fields that sometimes contain text — but intentionally.

Static rules don’t work.

Dynamic rules do.

🤖 3. LLMs help — but passing the entire sheet is a disaster

If you pass 5,000 rows:

❌ hallucinations

❌ inconsistent corrections

❌ smashed formats

❌ some rows get cleaned differently than others

But if you pass:

20–30 representative rows
column types
sheet type (Sales Sheet / Product Catalog / Property Listing etc.)
patterns in the data

LLMs suddenly behave like extremely consistent rule generators.

They say things like:

“Column Price seems to contain currency in mixed formats. Normalize as plain float with 2 decimals.”
“Column Category contains variants. Canonicalize based on semantic similarity.”
“Column BHK in this dataset is numeric but appears in words. Convert consistently.”

Then you take those rules and deterministically apply them across the whole dataset.

This works far better than the naïve “ask LLM to clean the whole sheet.”

🔄 4. The hardest part: conflicts & dependencies

Example:

Name	Gender
Ehsan	Male
Ehsan	Female

Which one is correct?

Or:

Product	Category
Mouse	Accessories
Mouse	Peripheral

Or dates like:

02/03/24 (DD/MM/YY?)
02/03/24 (MM/DD/YY?)

The only reliable way I found:

Detect potential conflicts
Ask AI to propose resolutions
Let the user approve or discard
After approval, future sheets auto-resolve the same conflict

Human-in-the-loop.

But only when necessary.

🧪 5. Trying to merge two sheets is where everything truly breaks

A friend shared this experience:

“We receive a sheet every hour with 3–4K rows.

It looks similar… but merging it with our master sheet always breaks.”

Why?

Because two seemingly identical sheets may have:

different casing
different punctuation
different abbreviations
different spacing
different numeric formats
different canonical values

The data is “the same” — but not identical.

This is exactly the kind of problem that made me build RowTidy:
https://rowtidy.com

🎯 What I learned

Data cleaning is not about formulas.

It’s not about regex.

It’s not even about machine learning.

It’s about:

understanding intent
catching inconsistency
normalizing semantics
detecting dependencies
handling edge cases without destroying valid data
doing all this without annoying the user

The more spreadsheets I cleaned, the more I realized:

The problem isn’t that data is messy.

The problem is that every dataset is messy in a different way.

And that’s exactly why automation is both difficult and meaningful.

If you deal with messy CSVs, duplications, conflicting columns, or imports that constantly break — I’d love to hear about your workflow.

Also, if you're curious, you can check what I’m building here:

👉 https://rowtidy.com

DEV Community