Why Data Cleaning is 80% of Data Science

#data #datascience #dataengineering

When most people imagine Data Science, they picture building shiny machine learning models or crafting beautiful dashboards.
But here’s the truth: if your data is messy, no model can save you.

That’s why people say:
👉 “80% of a data scientist’s time is spent cleaning data.”

And it’s not a joke.

Common Data Cleaning Nightmares

Missing values → blank cells, NaNs, or inconsistent entries.
Duplicates → one user recorded multiple times.
Outliers → extreme values like a salary of $1,000,000,000.
Inconsistent categories → Nairobi, NBO, 254-Nairobi treated as different.
Wrong data types → numbers stored as text.

🛠 Practical Cleaning in Python (with Pandas)

Here’s how I usually tackle some of these:

import pandas as pd

# Example dataset
data = {
    "Name": ["Alice", "Bob", "Charlie", "Alice"],
    "Age": [25, None, 37, 25],
    "Salary": [50000, 60000, 1000000000, 50000],  # notice the outlier
    "City": ["Nairobi", "NBI", "254-Nairobi", "Nairobi"]
}

df = pd.DataFrame(data)

print("Before Cleaning:")
print(df)

# 1. Handle missing values
df["Age"].fillna(df["Age"].median(), inplace=True)

# 2. Drop duplicates
df.drop_duplicates(inplace=True)

# 3. Fix inconsistent categories
df["City"] = df["City"].replace({"NBI": "Nairobi", "254-Nairobi": "Nairobi"})

# 4. Handle outliers (example: cap salaries above 200k)
df["Salary"] = df["Salary"].apply(lambda x: min(x, 200000))

print("\nAfter Cleaning:")
print(df)

✅ Output

Before Cleaning:

      Name   Age       Salary         City
0    Alice  25.0       50000      Nairobi
1      Bob   NaN       60000          NBI
2  Charlie  37.0  1000000000  254-Nairobi
3    Alice  25.0       50000      Nairobi

After Cleaning:

      Name   Age  Salary     City
0    Alice  25.0   50000  Nairobi
1      Bob  25.0   60000  Nairobi
2  Charlie  37.0  200000  Nairobi

Key Takeaways

Don’t ignore missing values, understand the why before filling/dropping.
Standardize categories early to avoid “ghost” groups.
Outliers may be errors or real rare events, investigate before deleting.
A clean dataset can outperform a complex model trained on dirty data.

DEV Community

Why Data Cleaning is 80% of Data Science

Common Data Cleaning Nightmares

🛠 Practical Cleaning in Python (with Pandas)

✅ Output

Key Takeaways

Top comments (0)