I’ve spent the last few months cleaning spreadsheets — not because I love it (I don’t), but because I was building something that required understanding how bad real-world data actually is.
If you’ve ever worked with CSVs exported from different tools, you already know the pain:
-
male,Male,MALE,m -
₹20,000,20k,20000.00,₹ 20,000/- - Dates like
12/03/24,2024-03-12, and03-12-2024all in the same sheet - Category columns where “Electronics”, “Electronic”, “Elec.” all refer to the same thing
- Rows where values literally belong to different columns
- Two sheets that should merge cleanly… but don’t
None of this is new. But when you try automating it, you realize something:
Cleaning data isn’t “just clean the data.”
It’s a chain of dependent decisions.
For context, I’m building a tool called RowTidy — an AI-powered data cleaning system — mostly because of these discoveries.
🔗 https://rowtidy.com
This post is not promotional — it’s the story of what I learned while trying to make sense of messy spreadsheets.
🧩 1. No two spreadsheets are ever the same
Even inside the same organization, I’ve seen:
- Qty, quantity, QTY, QTY. , QTY ( )
- Customer Name, Client Name, Name
- Price, Rate, Cost, Total Price
Spreadsheet headers evolve based on:
- who created them
- which department exported them
- which software wrote them
- how fast the person writing it was in a rush
Spreadsheets reflect human chaos.
🧹 2. Hardcoded rules always break on the next file
You can write rules:
- trim spaces
- fix capitalization
- remove symbols
- normalize dates
- normalize currency
…but then you get a file that uses one of those “dirty” formats intentionally.
Or a system where L means lakhs, but another team uses k.
Or numerical fields that sometimes contain text — but intentionally.
Static rules don’t work.
Dynamic rules do.
🤖 3. LLMs help — but passing the entire sheet is a disaster
If you pass 5,000 rows:
❌ hallucinations
❌ inconsistent corrections
❌ smashed formats
❌ some rows get cleaned differently than others
But if you pass:
- 20–30 representative rows
- column types
- sheet type (Sales Sheet / Product Catalog / Property Listing etc.)
- patterns in the data
LLMs suddenly behave like extremely consistent rule generators.
They say things like:
- “Column Price seems to contain currency in mixed formats. Normalize as plain float with 2 decimals.”
- “Column Category contains variants. Canonicalize based on semantic similarity.”
- “Column BHK in this dataset is numeric but appears in words. Convert consistently.”
Then you take those rules and deterministically apply them across the whole dataset.
This works far better than the naïve “ask LLM to clean the whole sheet.”
🔄 4. The hardest part: conflicts & dependencies
Example:
| Name | Gender |
|---|---|
| Ehsan | Male |
| Ehsan | Female |
Which one is correct?
Or:
| Product | Category |
|---|---|
| Mouse | Accessories |
| Mouse | Peripheral |
Or dates like:
-
02/03/24(DD/MM/YY?) -
02/03/24(MM/DD/YY?)
The only reliable way I found:
- Detect potential conflicts
- Ask AI to propose resolutions
- Let the user approve or discard
- After approval, future sheets auto-resolve the same conflict
Human-in-the-loop.
But only when necessary.
🧪 5. Trying to merge two sheets is where everything truly breaks
A friend shared this experience:
“We receive a sheet every hour with 3–4K rows.
It looks similar… but merging it with our master sheet always breaks.”
Why?
Because two seemingly identical sheets may have:
- different casing
- different punctuation
- different abbreviations
- different spacing
- different numeric formats
- different canonical values
The data is “the same” — but not identical.
This is exactly the kind of problem that made me build RowTidy:
https://rowtidy.com
🎯 What I learned
Data cleaning is not about formulas.
It’s not about regex.
It’s not even about machine learning.
It’s about:
- understanding intent
- catching inconsistency
- normalizing semantics
- detecting dependencies
- handling edge cases without destroying valid data
- doing all this without annoying the user
The more spreadsheets I cleaned, the more I realized:
The problem isn’t that data is messy.
The problem is that every dataset is messy in a different way.
And that’s exactly why automation is both difficult and meaningful.
If you deal with messy CSVs, duplications, conflicting columns, or imports that constantly break — I’d love to hear about your workflow.
Also, if you're curious, you can check what I’m building here:
👉 https://rowtidy.com
Top comments (0)