Welcome to Day 1 of the Statistics Challenge for Data Scientists.
In this series, we’ll cover practical and essential statistical concepts that every data scientist should master.
Today’s topic focuses on one of the most common challenges in data preprocessing — imputing missing values.
Simplest Method to Impute Missing Values
| Data Type | Method | When to Use |
|---|---|---|
| Numerical | Mean / Median | Use mean when data is normally distributed. Use median when data has outliers. |
| Categorical | Mode | Use mode (most frequent value) to fill missing categories. |
Example:
For a numerical column like Age, use median if there are extreme values (outliers).
For categorical data like Region or Category, use mode to fill missing values.
When to Avoid Mean/Median Imputation
While mean or median imputation is simple and quick, it isn’t always the right approach.
Avoid it when:
- A large percentage of data (e.g., >20%) is missing
- The missingness depends on other features (not random)
- You can estimate missing values more accurately using other related variables
In such cases, use model-based or feature-driven imputation, which preserves data integrity and relationships better.
Real-World Example: Manufacturing Client Case
For a manufacturing client, building a model to predict faulty truck parts.
One critical feature is the distance covered (miles) by each truck — higher distance meant a higher probability of part failure.
However, about 25% of the “miles” data was missing.
Using median imputation would have distorted the original data distribution and affected model accuracy.
Instead, building a simple XGBoost model to predict missing mile values based on:
- Type of truck
- Region of operation
- Engine life
- Daily usage
This approach will maintain the true data pattern and produced more reliable imputations.
Key Takeaways
| Situation | Best Approach |
|---|---|
| Small percentage of missing data | Mean / Median / Mode Imputation |
| Large percentage of missing data | Model-based or feature-driven imputation |
| Data has strong feature relationships | Predict missing values using related features |
| Data contains outliers | Use Median instead of Mean |
Pro Tip:
Always visualize your data distribution before and after imputation.
If the distribution changes significantly, reconsider your imputation method.
What’s Next
On Day 2, we’ll discuss Correlation vs. Causation — understanding how variables relate and why correlation doesn’t always mean causation.
Follow the #StatisticsChallenge to strengthen your statistical foundation, one concept at a time.


Top comments (0)