DEV Community

Cover image for I Tried Cleaning 5,000+ Rows of Messy Excel Data… and Realized the Problem Is Bigger Than I Thought
Mohammad Ehsan Ansari
Mohammad Ehsan Ansari

Posted on

I Tried Cleaning 5,000+ Rows of Messy Excel Data… and Realized the Problem Is Bigger Than I Thought

I’ve spent the last few months cleaning spreadsheets — not because I love it (I don’t), but because I was building something that required understanding how bad real-world data actually is.

If you’ve ever worked with CSVs exported from different tools, you already know the pain:

  • male, Male, MALE, m
  • ₹20,000, 20k, 20000.00, ₹ 20,000/-
  • Dates like 12/03/24, 2024-03-12, and 03-12-2024 all in the same sheet
  • Category columns where “Electronics”, “Electronic”, “Elec.” all refer to the same thing
  • Rows where values literally belong to different columns
  • Two sheets that should merge cleanly… but don’t

None of this is new. But when you try automating it, you realize something:

Cleaning data isn’t “just clean the data.”

It’s a chain of dependent decisions.

For context, I’m building a tool called RowTidy — an AI-powered data cleaning system — mostly because of these discoveries.

🔗 https://rowtidy.com

This post is not promotional — it’s the story of what I learned while trying to make sense of messy spreadsheets.


🧩 1. No two spreadsheets are ever the same

Even inside the same organization, I’ve seen:

  • Qty, quantity, QTY, QTY. , QTY ( )
  • Customer Name, Client Name, Name
  • Price, Rate, Cost, Total Price

Spreadsheet headers evolve based on:

  • who created them
  • which department exported them
  • which software wrote them
  • how fast the person writing it was in a rush

Spreadsheets reflect human chaos.


🧹 2. Hardcoded rules always break on the next file

You can write rules:

  • trim spaces
  • fix capitalization
  • remove symbols
  • normalize dates
  • normalize currency

…but then you get a file that uses one of those “dirty” formats intentionally.

Or a system where L means lakhs, but another team uses k.

Or numerical fields that sometimes contain text — but intentionally.

Static rules don’t work.

Dynamic rules do.


🤖 3. LLMs help — but passing the entire sheet is a disaster

If you pass 5,000 rows:

❌ hallucinations

❌ inconsistent corrections

❌ smashed formats

❌ some rows get cleaned differently than others

But if you pass:

  • 20–30 representative rows
  • column types
  • sheet type (Sales Sheet / Product Catalog / Property Listing etc.)
  • patterns in the data

LLMs suddenly behave like extremely consistent rule generators.

They say things like:

  • “Column Price seems to contain currency in mixed formats. Normalize as plain float with 2 decimals.”
  • “Column Category contains variants. Canonicalize based on semantic similarity.”
  • “Column BHK in this dataset is numeric but appears in words. Convert consistently.”

Then you take those rules and deterministically apply them across the whole dataset.

This works far better than the naïve “ask LLM to clean the whole sheet.”


🔄 4. The hardest part: conflicts & dependencies

Example:

Name Gender
Ehsan Male
Ehsan Female

Which one is correct?

Or:

Product Category
Mouse Accessories
Mouse Peripheral

Or dates like:

  • 02/03/24 (DD/MM/YY?)
  • 02/03/24 (MM/DD/YY?)

The only reliable way I found:

  1. Detect potential conflicts
  2. Ask AI to propose resolutions
  3. Let the user approve or discard
  4. After approval, future sheets auto-resolve the same conflict

Human-in-the-loop.

But only when necessary.


🧪 5. Trying to merge two sheets is where everything truly breaks

A friend shared this experience:

“We receive a sheet every hour with 3–4K rows.

It looks similar… but merging it with our master sheet always breaks.”

Why?

Because two seemingly identical sheets may have:

  • different casing
  • different punctuation
  • different abbreviations
  • different spacing
  • different numeric formats
  • different canonical values

The data is “the same” — but not identical.

This is exactly the kind of problem that made me build RowTidy:
https://rowtidy.com


🎯 What I learned

Data cleaning is not about formulas.

It’s not about regex.

It’s not even about machine learning.

It’s about:

  • understanding intent
  • catching inconsistency
  • normalizing semantics
  • detecting dependencies
  • handling edge cases without destroying valid data
  • doing all this without annoying the user

The more spreadsheets I cleaned, the more I realized:

The problem isn’t that data is messy.

The problem is that every dataset is messy in a different way.

And that’s exactly why automation is both difficult and meaningful.


If you deal with messy CSVs, duplications, conflicting columns, or imports that constantly break — I’d love to hear about your workflow.

Also, if you're curious, you can check what I’m building here:

👉 https://rowtidy.com

Top comments (0)