Practical Guide to Imputing Missing Values in Data Science

#statistics #datascience #ai #programming

Welcome to Day 1 of the Statistics Challenge for Data Scientists.

In this series, we’ll cover practical and essential statistical concepts that every data scientist should master.

Today’s topic focuses on one of the most common challenges in data preprocessing — imputing missing values.

Simplest Method to Impute Missing Values

Data Type	Method	When to Use
Numerical	Mean / Median	Use mean when data is normally distributed. Use median when data has outliers.
Categorical	Mode	Use mode (most frequent value) to fill missing categories.

Example:

For a numerical column like Age, use median if there are extreme values (outliers).

For categorical data like Region or Category, use mode to fill missing values.

When to Avoid Mean/Median Imputation

While mean or median imputation is simple and quick, it isn’t always the right approach.

Avoid it when:

A large percentage of data (e.g., >20%) is missing
The missingness depends on other features (not random)
You can estimate missing values more accurately using other related variables

In such cases, use model-based or feature-driven imputation, which preserves data integrity and relationships better.

Real-World Example: Manufacturing Client Case

For a manufacturing client, building a model to predict faulty truck parts.

One critical feature is the distance covered (miles) by each truck — higher distance meant a higher probability of part failure.

However, about 25% of the “miles” data was missing.

Using median imputation would have distorted the original data distribution and affected model accuracy.

Instead, building a simple XGBoost model to predict missing mile values based on:

Type of truck
Region of operation
Engine life
Daily usage

This approach will maintain the true data pattern and produced more reliable imputations.

Key Takeaways

Situation	Best Approach
Small percentage of missing data	Mean / Median / Mode Imputation
Large percentage of missing data	Model-based or feature-driven imputation
Data has strong feature relationships	Predict missing values using related features
Data contains outliers	Use Median instead of Mean

Pro Tip:

Always visualize your data distribution before and after imputation.

If the distribution changes significantly, reconsider your imputation method.