When most people imagine Data Science, they picture building shiny machine learning models or crafting beautiful dashboards.
But here’s the truth: if your data is messy, no model can save you.
That’s why people say:
👉 “80% of a data scientist’s time is spent cleaning data.”
And it’s not a joke.
Common Data Cleaning Nightmares
-
Missing values → blank cells,
NaN
s, or inconsistent entries. - Duplicates → one user recorded multiple times.
-
Outliers → extreme values like a salary of
$1,000,000,000
. -
Inconsistent categories →
Nairobi
,NBO
,254-Nairobi
treated as different. - Wrong data types → numbers stored as text.
🛠 Practical Cleaning in Python (with Pandas)
Here’s how I usually tackle some of these:
import pandas as pd
# Example dataset
data = {
"Name": ["Alice", "Bob", "Charlie", "Alice"],
"Age": [25, None, 37, 25],
"Salary": [50000, 60000, 1000000000, 50000], # notice the outlier
"City": ["Nairobi", "NBI", "254-Nairobi", "Nairobi"]
}
df = pd.DataFrame(data)
print("Before Cleaning:")
print(df)
# 1. Handle missing values
df["Age"].fillna(df["Age"].median(), inplace=True)
# 2. Drop duplicates
df.drop_duplicates(inplace=True)
# 3. Fix inconsistent categories
df["City"] = df["City"].replace({"NBI": "Nairobi", "254-Nairobi": "Nairobi"})
# 4. Handle outliers (example: cap salaries above 200k)
df["Salary"] = df["Salary"].apply(lambda x: min(x, 200000))
print("\nAfter Cleaning:")
print(df)
✅ Output
Before Cleaning:
Name Age Salary City
0 Alice 25.0 50000 Nairobi
1 Bob NaN 60000 NBI
2 Charlie 37.0 1000000000 254-Nairobi
3 Alice 25.0 50000 Nairobi
After Cleaning:
Name Age Salary City
0 Alice 25.0 50000 Nairobi
1 Bob 25.0 60000 Nairobi
2 Charlie 37.0 200000 Nairobi
Key Takeaways
- Don’t ignore missing values, understand the why before filling/dropping.
- Standardize categories early to avoid “ghost” groups.
- Outliers may be errors or real rare events, investigate before deleting.
- A clean dataset can outperform a complex model trained on dirty data.
Top comments (0)