DEV Community

Cover image for Handling Missing Data: The Detective's Guide to Solving the Case of the Vanishing Values
Sachin Kr. Rajput
Sachin Kr. Rajput

Posted on

Handling Missing Data: The Detective's Guide to Solving the Case of the Vanishing Values

The One-Line Summary: Missing data isn't a death sentence. You can delete it, fill it in, predict it, or flag it. The right choice depends on WHY it's missing and HOW MUCH is gone.


The Case of the Incomplete Census

Detective Rivera stared at the census forms spread across her desk.

The mayor needed population statistics by Monday. But something was wrong.

Half the forms were incomplete.

Some people left their income blank. Some skipped the age question. A few suspicious souls left entire pages empty.

Rivera had a decision to make.


Option 1: Throw Out the Incomplete Forms

"If they didn't fill it out completely, they don't count."

She could toss every incomplete form in the trash. Only use the perfect ones.

Problem: She'd lose half her data. The remaining sample might be biased. Maybe only wealthy people filled out the income field. Maybe only young people completed everything.

The statistics would be cleaner, but possibly wrong.


Option 2: Guess the Missing Values

"Income is blank? They probably earn the average. Age is blank? Let's say 35."

She could fill in the blanks with reasonable guesses.

Problem: Her guesses might be systematically wrong. If poor people skip the income question, filling in the average overstates their income.

The data would be complete, but possibly misleading.


Option 3: Investigate WHY It's Missing

"Why did they leave this blank? What does the absence tell us?"

She could treat the missing value as information itself. Maybe people who skip the income question ARE different — and that difference matters.

This is the smart approach.


Detective Rivera's dilemma is YOUR dilemma every time you open a dataset.

Missing data is everywhere. How you handle it determines whether your analysis is brilliant or broken.

Let me show you every tool in the detective's toolkit.


Why Is Data Missing?

Before you fix missing data, you need to understand WHY it's missing.

This isn't philosophical — it's practical. The reason determines the solution.

Type 1: Missing Completely at Random (MCAR)

What it means: The missingness has NOTHING to do with any data.

Example: A survey server crashed randomly, losing some responses. The lost responses aren't different from the kept ones — it was pure bad luck.

Good news: You can usually just delete these rows or fill them in. The remaining data is still representative.

Data lost because:
✓ Server crash
✓ Coffee spilled on paper forms
✓ Random transmission error

No pattern. Just chaos.
Enter fullscreen mode Exit fullscreen mode

Type 2: Missing at Random (MAR)

What it means: The missingness is related to OTHER variables you can see, but not to the missing value itself.

Example: Young people skip the income question more often. The missingness relates to age (which you have), not to income itself.

Strategy: If you know the pattern, you can account for it. Use age to help predict the missing income.

Pattern you can SEE:
Age 18-25 → 40% skip income
Age 26-40 → 20% skip income
Age 40+   → 10% skip income

You can use age to help fill in the blanks.
Enter fullscreen mode Exit fullscreen mode

Type 3: Missing Not at Random (MNAR)

What it means: The missingness is related to THE MISSING VALUE ITSELF.

Example: People with low income skip the income question because they're embarrassed. The very thing you're trying to measure causes the missingness.

Bad news: This is the hardest case. You can't fully recover the information because you don't know what you don't know.

Pattern you CAN'T see:
Low income → Skip income question
High income → Answer honestly

The missingness IS the information.
Filling in the average would be WRONG.
Enter fullscreen mode Exit fullscreen mode

The Arsenal: Every Way to Handle Missing Data

Now let's get practical. Here are your options, from simple to sophisticated.


Method 1: Delete the Rows (Listwise Deletion)

The approach: Any row with a missing value? Gone.

import pandas as pd

# Before: 1000 rows, some with missing values
print(f"Before: {len(df)} rows")

# After: Only complete rows
df_clean = df.dropna()
print(f"After: {len(df_clean)} rows")
Enter fullscreen mode Exit fullscreen mode

Visual:

Original:
[Name] [Age] [Income] [City]
[John]  [32]  [50000]  [NYC]     ✓ Keep
[Mary]  [28]  [  ?  ]  [LA]      ✗ Delete (missing income)
[Bob]   [45]  [72000]  [  ?  ]   ✗ Delete (missing city)
[Sara]  [36]  [61000]  [Chicago] ✓ Keep

Result: Lost 50% of data!
Enter fullscreen mode Exit fullscreen mode

When to Use It

✅ Data is MCAR (missing completely at random)
✅ Very few rows have missing values (< 5%)
✅ You have LOTS of data to spare
✅ Quick and dirty analysis

When to Avoid It

❌ Lots of missing data (you'll lose too much)
❌ Data is MAR or MNAR (you'll introduce bias)
❌ Missing values are informative
❌ Every row matters


Method 2: Delete the Columns

The approach: If a column has too many missing values, drop the entire column.

# Drop columns with more than 50% missing
threshold = len(df) * 0.5
df_clean = df.dropna(axis=1, thresh=threshold)

print(f"Columns before: {df.shape[1]}")
print(f"Columns after: {df_clean.shape[1]}")
Enter fullscreen mode Exit fullscreen mode

Visual:

Original:
[Name] [Age] [Income] [FavoriteColor]
[John]  [32]  [50000]      [?]
[Mary]  [28]  [  ?  ]      [?]
[Bob]   [45]  [72000]      [?]
[Sara]  [36]  [61000]      [Blue]

FavoriteColor: 75% missing → DROP THE COLUMN

Result: 
[Name] [Age] [Income]
  ✓ No missing values in remaining columns!
Enter fullscreen mode Exit fullscreen mode

When to Use It

✅ A column has massive missingness (> 50-70%)
✅ The column isn't critical to your analysis
✅ You have other columns with similar information

When to Avoid It

❌ The column is essential
❌ The missingness itself is informative
❌ You only have a few columns to begin with


Method 3: Fill with a Constant (Simple Imputation)

The approach: Replace missing values with a single value — mean, median, mode, or a custom constant.

# Fill with mean
df['income'].fillna(df['income'].mean(), inplace=True)

# Fill with median (better for skewed data)
df['income'].fillna(df['income'].median(), inplace=True)

# Fill with mode (for categorical)
df['city'].fillna(df['city'].mode()[0], inplace=True)

# Fill with a custom value
df['income'].fillna(0, inplace=True)  # Or -1 as a flag
Enter fullscreen mode Exit fullscreen mode

Visual:

Before:
Income: [50000, ?, 72000, ?, 61000, 55000]

Mean = 59,500

After (mean imputation):
Income: [50000, 59500, 72000, 59500, 61000, 55000]
               ↑              ↑
         Filled with mean
Enter fullscreen mode Exit fullscreen mode

Mean vs Median vs Mode

Method Use When
Mean Data is roughly symmetric
Median Data is skewed (outliers present)
Mode Categorical data
# Example: Salary data with CEO outlier
salaries = [50000, 55000, 48000, 52000, 5000000]  # CEO!

mean = 1,041,000    # Pulled up by CEO
median = 52,000     # Unaffected by CEO

# Median is better here!
Enter fullscreen mode Exit fullscreen mode

The Danger of Mean/Median Imputation

It underestimates variance.

Original variance:  ████████████████████
After mean impute:  ████████████
                         ↑
              You've squished the spread!
Enter fullscreen mode Exit fullscreen mode

By filling with the average, you're making your data artificially "clumpy" around the center. This can break statistical tests and bias predictions.


Method 4: Fill Forward / Backward (Time Series)

The approach: For time-ordered data, use the previous (or next) value.

# Forward fill: Use the last known value
df['stock_price'].fillna(method='ffill', inplace=True)

# Backward fill: Use the next known value
df['stock_price'].fillna(method='bfill', inplace=True)
Enter fullscreen mode Exit fullscreen mode

Visual:

Time series:
Day:   1      2      3      4      5      6
Price: 100    102    ?      ?      108    110

Forward fill (ffill):
Price: 100    102    102    102    108    110
                     ↑      ↑
              Carried forward from day 2

Backward fill (bfill):
Price: 100    102    108    108    108    110
                     ↑      ↑
              Pulled back from day 5
Enter fullscreen mode Exit fullscreen mode

When to Use It

✅ Time series data
✅ Values change slowly over time
✅ The last known value is a reasonable estimate

When to Avoid It

❌ Data isn't time-ordered
❌ Values can change dramatically between observations
❌ Long gaps in the data


Method 5: Interpolation (Time Series)

The approach: Estimate missing values based on surrounding values.

# Linear interpolation
df['temperature'].interpolate(method='linear', inplace=True)

# Polynomial interpolation (smoother)
df['temperature'].interpolate(method='polynomial', order=2, inplace=True)

# Time-based interpolation
df['temperature'].interpolate(method='time', inplace=True)
Enter fullscreen mode Exit fullscreen mode

Visual:

Day:    1      2      3      4      5
Temp:   60     64     ?      ?      76

Linear interpolation:
Temp:   60     64     68     72     76
                      ↑      ↑
           Evenly spaced between 64 and 76
Enter fullscreen mode Exit fullscreen mode

Types of Interpolation

Method Shape Use When
Linear Straight line Simple trends
Polynomial Curved Smooth, curved patterns
Spline Very smooth curve Complex patterns
Time-based Accounts for timestamps Irregular time intervals

Method 6: K-Nearest Neighbors Imputation

The approach: Find similar rows (neighbors) and use their values.

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(
    imputer.fit_transform(df),
    columns=df.columns
)
Enter fullscreen mode Exit fullscreen mode

Visual:

Target row: [Age=30, Income=?, City=NYC, Education=Bachelor's]

Find 5 most similar rows:
1. [Age=31, Income=52000, City=NYC, Education=Bachelor's]
2. [Age=29, Income=48000, City=NYC, Education=Bachelor's]
3. [Age=32, Income=55000, City=NYC, Education=Master's]
4. [Age=28, Income=51000, City=Boston, Education=Bachelor's]
5. [Age=30, Income=49000, City=NYC, Education=Bachelor's]

Average their incomes: (52000+48000+55000+51000+49000)/5 = 51,000

Fill: [Age=30, Income=51000, City=NYC, Education=Bachelor's]
Enter fullscreen mode Exit fullscreen mode

Why It's Smart

The imputed value is based on similar people, not everyone. A 30-year-old in NYC with a bachelor's degree probably earns like other 30-year-olds in NYC with bachelor's degrees — not like the overall average.

When to Use It

✅ Data is MAR (pattern depends on other variables)
✅ You have good features to define "similarity"
✅ Moderate amount of missing data

When to Avoid It

❌ High dimensionality (similarity becomes meaningless)
❌ Data is MNAR
❌ Computationally expensive for large datasets


Method 7: Model-Based Imputation

The approach: Train a model to PREDICT the missing values.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor

# Use Random Forest to predict missing values
imputer = IterativeImputer(
    estimator=RandomForestRegressor(n_estimators=10, random_state=42),
    max_iter=10,
    random_state=42
)

df_imputed = pd.DataFrame(
    imputer.fit_transform(df),
    columns=df.columns
)
Enter fullscreen mode Exit fullscreen mode

How It Works:

Step 1: For each column with missing values
Step 2: Treat it as the TARGET
Step 3: Use other columns as FEATURES
Step 4: Train a model on complete rows
Step 5: Predict missing values
Step 6: Repeat until convergence
Enter fullscreen mode Exit fullscreen mode

Visual:

Predicting missing income:

Features: [Age, Education, City, JobTitle]
Target:   [Income]

Train model on rows WITH income
Predict income for rows WITHOUT income

Income_missing = Model(Age, Education, City, JobTitle)
Enter fullscreen mode Exit fullscreen mode

When to Use It

✅ Strong relationships between variables
✅ Moderate amount of missing data
✅ You need high accuracy

When to Avoid It

❌ Variables are independent
❌ Massive missingness
❌ Risk of data leakage in ML pipelines


Method 8: Create a "Missing" Indicator

The approach: Don't fill it. FLAG it.

# Create indicator column
df['income_missing'] = df['income'].isna().astype(int)

# Then fill the original (with mean, median, etc.)
df['income'].fillna(df['income'].median(), inplace=True)
Enter fullscreen mode Exit fullscreen mode

Visual:

Before:
[Name] [Income]
[John]  [50000]
[Mary]  [  ?  ]
[Bob]   [72000]

After:
[Name] [Income] [Income_Missing]
[John]  [50000]       [0]
[Mary]  [61000]       [1]    ← Filled + Flagged
[Bob]   [72000]       [0]
Enter fullscreen mode Exit fullscreen mode

Why This Is Brilliant

Sometimes the fact that data is missing IS the information.

  • People who skip income might be low earners (embarrassed)
  • Missing medical tests might indicate the patient was too sick
  • Empty survey fields might indicate disengagement

By flagging, your model can LEARN from the missingness pattern.

When to Use It

✅ Suspecting MNAR (missingness is informative)
✅ Building ML models
✅ Want to preserve the "missing" signal


Method 9: Multiple Imputation

The approach: Don't create ONE filled dataset. Create MANY.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np

# Create multiple imputed datasets
n_imputations = 5
imputed_datasets = []

for i in range(n_imputations):
    imputer = IterativeImputer(random_state=i, sample_posterior=True)
    imputed = imputer.fit_transform(df)
    imputed_datasets.append(imputed)

# Analyze each, then combine results
results = []
for dataset in imputed_datasets:
    # Run your analysis on each
    result = your_analysis(dataset)
    results.append(result)

# Combine using Rubin's rules
final_estimate = np.mean(results)
Enter fullscreen mode Exit fullscreen mode

Visual:

Original dataset (with missing):
     [████  ?  ████  ?  ████]

Create 5 different filled versions:
     [████ 52 ████ 47 ████]  → Analysis 1 → Result 1
     [████ 48 ████ 51 ████]  → Analysis 2 → Result 2
     [████ 55 ████ 49 ████]  → Analysis 3 → Result 3
     [████ 50 ████ 53 ████]  → Analysis 4 → Result 4
     [████ 51 ████ 48 ████]  → Analysis 5 → Result 5

Combine: Average the 5 results
Enter fullscreen mode Exit fullscreen mode

Why This Is the Gold Standard

Single imputation pretends you KNOW the missing value. You don't.

Multiple imputation acknowledges the UNCERTAINTY. By creating multiple plausible datasets, you capture the range of possibilities.

When to Use It

✅ Statistical research requiring valid inference
✅ Uncertainty quantification matters
✅ You need to report confidence intervals


The Decision Flowchart

Not sure which method to use? Follow this:

START
  │
  ▼
How much data is missing?
  │
  ├─ < 5% ───────────────────────────────────────┐
  │                                              │
  ├─ 5-20% ──────────────────────────────┐       │
  │                                      │       │
  └─ > 20% ──────────────────┐           │       │
                             │           │       │
                             ▼           ▼       ▼
                     Consider if    What type    Simple
                     column is      of missing?  deletion
                     worth keeping      │        usually OK
                             │          │            │
                             ▼          │            │
                        Maybe drop      │            │
                        the column      │            │
                                        ▼            │
                              ┌─────────┴─────────┐  │
                              │                   │  │
                           MCAR?               MAR/MNAR?
                              │                   │
                              ▼                   ▼
                         Mean/Median        KNN, Model-based,
                         imputation OK      or Indicator flag
Enter fullscreen mode Exit fullscreen mode

Complete Code Example

Let's handle missing data in a real scenario:

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Create sample data with missing values
np.random.seed(42)
df = pd.DataFrame({
    'age': [25, 30, np.nan, 45, 50, 35, np.nan, 40, 28, 55],
    'income': [50000, np.nan, 65000, np.nan, 90000, 55000, 70000, np.nan, 48000, 95000],
    'education': ['Bachelor', 'Master', 'Bachelor', np.nan, 'PhD', 'Bachelor', 'Master', 'Bachelor', np.nan, 'PhD'],
    'city': ['NYC', 'LA', np.nan, 'Chicago', 'NYC', 'LA', 'NYC', np.nan, 'Chicago', 'LA']
})

print("=== Original Data ===")
print(df)
print(f"\nMissing values:\n{df.isnull().sum()}")

# Strategy 1: Simple deletion
df_deleted = df.dropna()
print(f"\n=== After Deletion ===")
print(f"Rows remaining: {len(df_deleted)} / {len(df)}")

# Strategy 2: Mean/Median for numeric
df_simple = df.copy()
df_simple['age'].fillna(df_simple['age'].median(), inplace=True)
df_simple['income'].fillna(df_simple['income'].median(), inplace=True)

# Strategy 3: Mode for categorical
df_simple['education'].fillna(df_simple['education'].mode()[0], inplace=True)
df_simple['city'].fillna(df_simple['city'].mode()[0], inplace=True)

print(f"\n=== After Simple Imputation ===")
print(df_simple)

# Strategy 4: KNN for numeric columns
df_knn = df.copy()
numeric_cols = ['age', 'income']
knn_imputer = KNNImputer(n_neighbors=3)
df_knn[numeric_cols] = knn_imputer.fit_transform(df_knn[numeric_cols])

print(f"\n=== After KNN Imputation (numeric only) ===")
print(df_knn[numeric_cols].round(0))

# Strategy 5: Add missing indicators
df_flagged = df.copy()
for col in df.columns:
    if df[col].isnull().any():
        df_flagged[f'{col}_missing'] = df[col].isnull().astype(int)

print(f"\n=== With Missing Indicators ===")
print(df_flagged.columns.tolist())
Enter fullscreen mode Exit fullscreen mode

Output:

=== Original Data ===
    age   income education     city
0  25.0  50000.0  Bachelor      NYC
1  30.0      NaN    Master       LA
2   NaN  65000.0  Bachelor      NaN
3  45.0      NaN       NaN  Chicago
4  50.0  90000.0       PhD      NYC
5  35.0  55000.0  Bachelor       LA
6   NaN  70000.0    Master      NYC
7  40.0      NaN  Bachelor      NaN
8  28.0  48000.0       NaN  Chicago
9  55.0  95000.0       PhD       LA

Missing values:
age          2
income       3
education    2
city         2

=== After Deletion ===
Rows remaining: 4 / 10

=== After Simple Imputation ===
    age   income education     city
0  25.0  50000.0  Bachelor      NYC
1  30.0  62500.0    Master       LA
2  37.5  65000.0  Bachelor      NYC
3  45.0  62500.0  Bachelor  Chicago
4  50.0  90000.0       PhD      NYC
5  35.0  55000.0  Bachelor       LA
6  37.5  70000.0    Master      NYC
7  40.0  62500.0  Bachelor      NYC
8  28.0  48000.0  Bachelor  Chicago
9  55.0  95000.0       PhD       LA

=== After KNN Imputation (numeric only) ===
    age    income
0  25.0   50000.0
1  30.0   51000.0
2  32.0   65000.0
3  45.0   78333.0
4  50.0   90000.0
5  35.0   55000.0
6  32.0   70000.0
7  40.0   58333.0
8  28.0   48000.0
9  55.0   95000.0

=== With Missing Indicators ===
['age', 'income', 'education', 'city', 'age_missing', 'income_missing', 'education_missing', 'city_missing']
Enter fullscreen mode Exit fullscreen mode

Common Mistakes

Mistake 1: Imputing Before Train-Test Split

# WRONG: Impute on full dataset, then split
imputer.fit_transform(df)  # Uses test data statistics!
X_train, X_test = train_test_split(df)

# RIGHT: Split first, impute on train, transform test
X_train, X_test = train_test_split(df)
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)  # Uses TRAIN statistics
Enter fullscreen mode Exit fullscreen mode

Mistake 2: Ignoring the Missingness Pattern

# WRONG: Just fill everything with the mean
df.fillna(df.mean())

# RIGHT: Investigate first
print(df.isnull().sum())
print(df[df['income'].isnull()]['education'].value_counts())  # Pattern?
Enter fullscreen mode Exit fullscreen mode

Mistake 3: Using Mean for Skewed Data

# WRONG: Mean with outliers
salaries = [50000, 55000, 48000, 5000000]  # CEO!
mean_salary = np.mean(salaries)  # 1,288,250 - way too high!

# RIGHT: Median for skewed data
median_salary = np.median(salaries)  # 52,500 - reasonable
Enter fullscreen mode Exit fullscreen mode

Mistake 4: Forgetting Categorical Variables

# WRONG: KNN imputer on categorical columns
knn.fit_transform(df)  # Breaks on 'city', 'education'

# RIGHT: Handle categorical separately
numeric_cols = df.select_dtypes(include=[np.number]).columns
categorical_cols = df.select_dtypes(include=['object']).columns

# Impute numeric with KNN
df[numeric_cols] = knn.fit_transform(df[numeric_cols])

# Impute categorical with mode
for col in categorical_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)
Enter fullscreen mode Exit fullscreen mode

The Cheat Sheet

Method When to Use Pros Cons
Delete rows < 5% missing, MCAR Simple, no bias Lose data
Delete columns > 50% missing Removes noise Lose features
Mean/Median Quick analysis, MCAR Fast, easy Underestimates variance
Mode Categorical data Simple Ignores patterns
Forward/Back fill Time series Preserves trends Can propagate errors
Interpolation Time series, smooth data Natural estimates Assumes continuity
KNN MAR, good features Uses similarity Slow, curse of dimensionality
Model-based Strong variable relationships Most accurate Complex, risk of leakage
Missing indicator MNAR, ML models Preserves information Adds columns
Multiple imputation Research, uncertainty Gold standard Complex, slow

Key Takeaways

  1. Understand WHY data is missing — MCAR, MAR, MNAR require different approaches

  2. Deletion is often the worst choice — You lose data AND might introduce bias

  3. Mean imputation is quick but dangerous — It squishes variance

  4. KNN imputation uses similar rows — Great when patterns exist

  5. Missing indicators preserve information — The absence IS data

  6. Impute AFTER train-test split — Avoid data leakage

  7. Different columns need different strategies — Numeric vs categorical

  8. Multiple imputation is the gold standard — For research and uncertainty


The One-Sentence Summary

Missing data isn't missing information — it's hidden information. Your job is to decode it, not delete it.


What's Next?

Now that you understand missing data handling, you're ready for:

  • Outlier Detection and Treatment — The other data cleaning challenge
  • Feature Encoding — Handling categorical variables
  • Data Normalization — Scaling features for models
  • Imbalanced Data — When classes aren't equal

Follow me for the next article in this series!


Let's Connect!

If this helped you handle missing data better, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

What's your go-to imputation method? I'm curious!


The difference between a dataset that trains a great model and one that trains garbage? Often just how you handled the missing values. Don't delete — investigate.


Share this with someone who's been dropping rows with NaN without thinking. Save their data. Save their model.

Happy cleaning!

Top comments (0)