DEV Community

Cover image for Practical Guide to Imputing Missing Values in Data Science
Chanchal Singh
Chanchal Singh

Posted on

Practical Guide to Imputing Missing Values in Data Science

Welcome to Day 1 of the Statistics Challenge for Data Scientists.

In this series, we’ll cover practical and essential statistical concepts that every data scientist should master.

Today’s topic focuses on one of the most common challenges in data preprocessing — imputing missing values.


Simplest Method to Impute Missing Values

Data Type Method When to Use
Numerical Mean / Median Use mean when data is normally distributed. Use median when data has outliers.
Categorical Mode Use mode (most frequent value) to fill missing categories.

Example:

For a numerical column like Age, use median if there are extreme values (outliers).

For categorical data like Region or Category, use mode to fill missing values.

histogram for imputating mean vs median values


When to Avoid Mean/Median Imputation

While mean or median imputation is simple and quick, it isn’t always the right approach.

Avoid it when:

  • A large percentage of data (e.g., >20%) is missing
  • The missingness depends on other features (not random)
  • You can estimate missing values more accurately using other related variables

In such cases, use model-based or feature-driven imputation, which preserves data integrity and relationships better.


Real-World Example: Manufacturing Client Case

For a manufacturing client, building a model to predict faulty truck parts.

One critical feature is the distance covered (miles) by each truck — higher distance meant a higher probability of part failure.

However, about 25% of the “miles” data was missing.

Using median imputation would have distorted the original data distribution and affected model accuracy.

Instead, building a simple XGBoost model to predict missing mile values based on:

  • Type of truck
  • Region of operation
  • Engine life
  • Daily usage

This approach will maintain the true data pattern and produced more reliable imputations.


Key Takeaways

Situation Best Approach
Small percentage of missing data Mean / Median / Mode Imputation
Large percentage of missing data Model-based or feature-driven imputation
Data has strong feature relationships Predict missing values using related features
Data contains outliers Use Median instead of Mean

Pro Tip:

Always visualize your data distribution before and after imputation.

If the distribution changes significantly, reconsider your imputation method.

Visualizing effect of imputation on data distribution


What’s Next

On Day 2, we’ll discuss Correlation vs. Causation — understanding how variables relate and why correlation doesn’t always mean causation.

Follow the #StatisticsChallenge to strengthen your statistical foundation, one concept at a time.

Top comments (0)