Sachin Kr. Rajput

Posted on Jan 21

Handling Missing Data: The Detective's Guide to Solving the Case of the Vanishing Values

#datascience #python #beginners #programming

The One-Line Summary: Missing data isn't a death sentence. You can delete it, fill it in, predict it, or flag it. The right choice depends on WHY it's missing and HOW MUCH is gone.

The Case of the Incomplete Census

Detective Rivera stared at the census forms spread across her desk.

The mayor needed population statistics by Monday. But something was wrong.

Half the forms were incomplete.

Some people left their income blank. Some skipped the age question. A few suspicious souls left entire pages empty.

Rivera had a decision to make.

Option 1: Throw Out the Incomplete Forms

"If they didn't fill it out completely, they don't count."

She could toss every incomplete form in the trash. Only use the perfect ones.

Problem: She'd lose half her data. The remaining sample might be biased. Maybe only wealthy people filled out the income field. Maybe only young people completed everything.

The statistics would be cleaner, but possibly wrong.

Option 2: Guess the Missing Values

"Income is blank? They probably earn the average. Age is blank? Let's say 35."

She could fill in the blanks with reasonable guesses.

Problem: Her guesses might be systematically wrong. If poor people skip the income question, filling in the average overstates their income.

The data would be complete, but possibly misleading.

Option 3: Investigate WHY It's Missing

"Why did they leave this blank? What does the absence tell us?"

She could treat the missing value as information itself. Maybe people who skip the income question ARE different — and that difference matters.

This is the smart approach.

Detective Rivera's dilemma is YOUR dilemma every time you open a dataset.

Missing data is everywhere. How you handle it determines whether your analysis is brilliant or broken.

Let me show you every tool in the detective's toolkit.

Why Is Data Missing?

Before you fix missing data, you need to understand WHY it's missing.

This isn't philosophical — it's practical. The reason determines the solution.

Type 1: Missing Completely at Random (MCAR)

What it means: The missingness has NOTHING to do with any data.

Example: A survey server crashed randomly, losing some responses. The lost responses aren't different from the kept ones — it was pure bad luck.

Good news: You can usually just delete these rows or fill them in. The remaining data is still representative.

Data lost because:
✓ Server crash
✓ Coffee spilled on paper forms
✓ Random transmission error

No pattern. Just chaos.

Type 2: Missing at Random (MAR)

What it means: The missingness is related to OTHER variables you can see, but not to the missing value itself.

Example: Young people skip the income question more often. The missingness relates to age (which you have), not to income itself.

Strategy: If you know the pattern, you can account for it. Use age to help predict the missing income.

Pattern you can SEE:
Age 18-25 → 40% skip income
Age 26-40 → 20% skip income
Age 40+   → 10% skip income

You can use age to help fill in the blanks.

Type 3: Missing Not at Random (MNAR)

What it means: The missingness is related to THE MISSING VALUE ITSELF.

Example: People with low income skip the income question because they're embarrassed. The very thing you're trying to measure causes the missingness.

Bad news: This is the hardest case. You can't fully recover the information because you don't know what you don't know.

Pattern you CAN'T see:
Low income → Skip income question
High income → Answer honestly

The missingness IS the information.
Filling in the average would be WRONG.

The Arsenal: Every Way to Handle Missing Data

Now let's get practical. Here are your options, from simple to sophisticated.

Method 1: Delete the Rows (Listwise Deletion)

The approach: Any row with a missing value? Gone.

import pandas as pd

# Before: 1000 rows, some with missing values
print(f"Before: {len(df)} rows")

# After: Only complete rows
df_clean = df.dropna()
print(f"After: {len(df_clean)} rows")

Visual:

Original:
[Name] [Age] [Income] [City]
[John]  [32]  [50000]  [NYC]     ✓ Keep
[Mary]  [28]  [  ?  ]  [LA]      ✗ Delete (missing income)
[Bob]   [45]  [72000]  [  ?  ]   ✗ Delete (missing city)
[Sara]  [36]  [61000]  [Chicago] ✓ Keep

Result: Lost 50% of data!

When to Use It

✅ Data is MCAR (missing completely at random)
✅ Very few rows have missing values (< 5%)
✅ You have LOTS of data to spare
✅ Quick and dirty analysis

When to Avoid It

❌ Lots of missing data (you'll lose too much)
❌ Data is MAR or MNAR (you'll introduce bias)
❌ Missing values are informative
❌ Every row matters

Method 2: Delete the Columns

The approach: If a column has too many missing values, drop the entire column.

# Drop columns with more than 50% missing
threshold = len(df) * 0.5
df_clean = df.dropna(axis=1, thresh=threshold)

print(f"Columns before: {df.shape[1]}")
print(f"Columns after: {df_clean.shape[1]}")

Visual:

Original:
[Name] [Age] [Income] [FavoriteColor]
[John]  [32]  [50000]      [?]
[Mary]  [28]  [  ?  ]      [?]
[Bob]   [45]  [72000]      [?]
[Sara]  [36]  [61000]      [Blue]

FavoriteColor: 75% missing → DROP THE COLUMN

Result: 
[Name] [Age] [Income]
  ✓ No missing values in remaining columns!

When to Use It

✅ A column has massive missingness (> 50-70%)
✅ The column isn't critical to your analysis
✅ You have other columns with similar information

When to Avoid It

❌ The column is essential
❌ The missingness itself is informative
❌ You only have a few columns to begin with

Method 3: Fill with a Constant (Simple Imputation)

The approach: Replace missing values with a single value — mean, median, mode, or a custom constant.

# Fill with mean
df['income'].fillna(df['income'].mean(), inplace=True)

# Fill with median (better for skewed data)
df['income'].fillna(df['income'].median(), inplace=True)

# Fill with mode (for categorical)
df['city'].fillna(df['city'].mode()[0], inplace=True)

# Fill with a custom value
df['income'].fillna(0, inplace=True)  # Or -1 as a flag

Visual:

Before:
Income: [50000, ?, 72000, ?, 61000, 55000]

Mean = 59,500

After (mean imputation):
Income: [50000, 59500, 72000, 59500, 61000, 55000]
               ↑              ↑
         Filled with mean

Mean vs Median vs Mode

Method	Use When
Mean	Data is roughly symmetric
Median	Data is skewed (outliers present)
Mode	Categorical data

# Example: Salary data with CEO outlier
salaries = [50000, 55000, 48000, 52000, 5000000]  # CEO!

mean = 1,041,000    # Pulled up by CEO
median = 52,000     # Unaffected by CEO

# Median is better here!

The Danger of Mean/Median Imputation

It underestimates variance.

Original variance:  ████████████████████
After mean impute:  ████████████
                         ↑
              You've squished the spread!

By filling with the average, you're making your data artificially "clumpy" around the center. This can break statistical tests and bias predictions.

Method 4: Fill Forward / Backward (Time Series)

The approach: For time-ordered data, use the previous (or next) value.

# Forward fill: Use the last known value
df['stock_price'].fillna(method='ffill', inplace=True)

# Backward fill: Use the next known value
df['stock_price'].fillna(method='bfill', inplace=True)

Visual:

Time series:
Day:   1      2      3      4      5      6
Price: 100    102    ?      ?      108    110

Forward fill (ffill):
Price: 100    102    102    102    108    110
                     ↑      ↑
              Carried forward from day 2

Backward fill (bfill):
Price: 100    102    108    108    108    110
                     ↑      ↑
              Pulled back from day 5

When to Use It

✅ Time series data
✅ Values change slowly over time
✅ The last known value is a reasonable estimate

When to Avoid It

❌ Data isn't time-ordered
❌ Values can change dramatically between observations
❌ Long gaps in the data

Method 5: Interpolation (Time Series)

The approach: Estimate missing values based on surrounding values.

# Linear interpolation
df['temperature'].interpolate(method='linear', inplace=True)

# Polynomial interpolation (smoother)
df['temperature'].interpolate(method='polynomial', order=2, inplace=True)

# Time-based interpolation
df['temperature'].interpolate(method='time', inplace=True)

Visual:

Day:    1      2      3      4      5
Temp:   60     64     ?      ?      76

Linear interpolation:
Temp:   60     64     68     72     76
                      ↑      ↑
           Evenly spaced between 64 and 76

Types of Interpolation

Method	Shape	Use When
Linear	Straight line	Simple trends
Polynomial	Curved	Smooth, curved patterns
Spline	Very smooth curve	Complex patterns
Time-based	Accounts for timestamps	Irregular time intervals

Method 6: K-Nearest Neighbors Imputation

The approach: Find similar rows (neighbors) and use their values.

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(
    imputer.fit_transform(df),
    columns=df.columns
)

Visual:

Target row: [Age=30, Income=?, City=NYC, Education=Bachelor's]

Find 5 most similar rows:
1. [Age=31, Income=52000, City=NYC, Education=Bachelor's]
2. [Age=29, Income=48000, City=NYC, Education=Bachelor's]
3. [Age=32, Income=55000, City=NYC, Education=Master's]
4. [Age=28, Income=51000, City=Boston, Education=Bachelor's]
5. [Age=30, Income=49000, City=NYC, Education=Bachelor's]

Average their incomes: (52000+48000+55000+51000+49000)/5 = 51,000

Fill: [Age=30, Income=51000, City=NYC, Education=Bachelor's]

Why It's Smart

The imputed value is based on similar people, not everyone. A 30-year-old in NYC with a bachelor's degree probably earns like other 30-year-olds in NYC with bachelor's degrees — not like the overall average.

When to Use It

✅ Data is MAR (pattern depends on other variables)
✅ You have good features to define "similarity"
✅ Moderate amount of missing data

When to Avoid It

❌ High dimensionality (similarity becomes meaningless)
❌ Data is MNAR
❌ Computationally expensive for large datasets

Method 7: Model-Based Imputation

The approach: Train a model to PREDICT the missing values.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor

# Use Random Forest to predict missing values
imputer = IterativeImputer(
    estimator=RandomForestRegressor(n_estimators=10, random_state=42),
    max_iter=10,
    random_state=42
)

df_imputed = pd.DataFrame(
    imputer.fit_transform(df),
    columns=df.columns
)

How It Works:

Step 1: For each column with missing values
Step 2: Treat it as the TARGET
Step 3: Use other columns as FEATURES
Step 4: Train a model on complete rows
Step 5: Predict missing values
Step 6: Repeat until convergence

Visual:

Predicting missing income:

Features: [Age, Education, City, JobTitle]
Target:   [Income]

Train model on rows WITH income
Predict income for rows WITHOUT income

Income_missing = Model(Age, Education, City, JobTitle)

When to Use It

✅ Strong relationships between variables
✅ Moderate amount of missing data
✅ You need high accuracy

When to Avoid It

❌ Variables are independent
❌ Massive missingness
❌ Risk of data leakage in ML pipelines

Method 8: Create a "Missing" Indicator

The approach: Don't fill it. FLAG it.

# Create indicator column
df['income_missing'] = df['income'].isna().astype(int)

# Then fill the original (with mean, median, etc.)
df['income'].fillna(df['income'].median(), inplace=True)

Visual:

Before:
[Name] [Income]
[John]  [50000]
[Mary]  [  ?  ]
[Bob]   [72000]

After:
[Name] [Income] [Income_Missing]
[John]  [50000]       [0]
[Mary]  [61000]       [1]    ← Filled + Flagged
[Bob]   [72000]       [0]

Why This Is Brilliant

Sometimes the fact that data is missing IS the information.

People who skip income might be low earners (embarrassed)
Missing medical tests might indicate the patient was too sick
Empty survey fields might indicate disengagement

By flagging, your model can LEARN from the missingness pattern.

When to Use It

✅ Suspecting MNAR (missingness is informative)
✅ Building ML models
✅ Want to preserve the "missing" signal

Method 9: Multiple Imputation

The approach: Don't create ONE filled dataset. Create MANY.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np

# Create multiple imputed datasets
n_imputations = 5
imputed_datasets = []

for i in range(n_imputations):
    imputer = IterativeImputer(random_state=i, sample_posterior=True)
    imputed = imputer.fit_transform(df)
    imputed_datasets.append(imputed)

# Analyze each, then combine results
results = []
for dataset in imputed_datasets:
    # Run your analysis on each
    result = your_analysis(dataset)
    results.append(result)

# Combine using Rubin's rules
final_estimate = np.mean(results)

Visual:

Original dataset (with missing):
     [████  ?  ████  ?  ████]

Create 5 different filled versions:
     [████ 52 ████ 47 ████]  → Analysis 1 → Result 1
     [████ 48 ████ 51 ████]  → Analysis 2 → Result 2
     [████ 55 ████ 49 ████]  → Analysis 3 → Result 3
     [████ 50 ████ 53 ████]  → Analysis 4 → Result 4
     [████ 51 ████ 48 ████]  → Analysis 5 → Result 5

Combine: Average the 5 results

Why This Is the Gold Standard

Single imputation pretends you KNOW the missing value. You don't.

Multiple imputation acknowledges the UNCERTAINTY. By creating multiple plausible datasets, you capture the range of possibilities.

When to Use It

✅ Statistical research requiring valid inference
✅ Uncertainty quantification matters
✅ You need to report confidence intervals

The Decision Flowchart

Not sure which method to use? Follow this:

START
  │
  ▼
How much data is missing?
  │
  ├─ < 5% ───────────────────────────────────────┐
  │                                              │
  ├─ 5-20% ──────────────────────────────┐       │
  │                                      │       │
  └─ > 20% ──────────────────┐           │       │
                             │           │       │
                             ▼           ▼       ▼
                     Consider if    What type    Simple
                     column is      of missing?  deletion
                     worth keeping      │        usually OK
                             │          │            │
                             ▼          │            │
                        Maybe drop      │            │
                        the column      │            │
                                        ▼            │
                              ┌─────────┴─────────┐  │
                              │                   │  │
                           MCAR?               MAR/MNAR?
                              │                   │
                              ▼                   ▼
                         Mean/Median        KNN, Model-based,
                         imputation OK      or Indicator flag

Complete Code Example

Let's handle missing data in a real scenario:

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Create sample data with missing values
np.random.seed(42)
df = pd.DataFrame({
    'age': [25, 30, np.nan, 45, 50, 35, np.nan, 40, 28, 55],
    'income': [50000, np.nan, 65000, np.nan, 90000, 55000, 70000, np.nan, 48000, 95000],
    'education': ['Bachelor', 'Master', 'Bachelor', np.nan, 'PhD', 'Bachelor', 'Master', 'Bachelor', np.nan, 'PhD'],
    'city': ['NYC', 'LA', np.nan, 'Chicago', 'NYC', 'LA', 'NYC', np.nan, 'Chicago', 'LA']
})

print("=== Original Data ===")
print(df)
print(f"\nMissing values:\n{df.isnull().sum()}")

# Strategy 1: Simple deletion
df_deleted = df.dropna()
print(f"\n=== After Deletion ===")
print(f"Rows remaining: {len(df_deleted)} / {len(df)}")

# Strategy 2: Mean/Median for numeric
df_simple = df.copy()
df_simple['age'].fillna(df_simple['age'].median(), inplace=True)
df_simple['income'].fillna(df_simple['income'].median(), inplace=True)

# Strategy 3: Mode for categorical
df_simple['education'].fillna(df_simple['education'].mode()[0], inplace=True)
df_simple['city'].fillna(df_simple['city'].mode()[0], inplace=True)

print(f"\n=== After Simple Imputation ===")
print(df_simple)

# Strategy 4: KNN for numeric columns
df_knn = df.copy()
numeric_cols = ['age', 'income']
knn_imputer = KNNImputer(n_neighbors=3)
df_knn[numeric_cols] = knn_imputer.fit_transform(df_knn[numeric_cols])

print(f"\n=== After KNN Imputation (numeric only) ===")
print(df_knn[numeric_cols].round(0))

# Strategy 5: Add missing indicators
df_flagged = df.copy()
for col in df.columns:
    if df[col].isnull().any():
        df_flagged[f'{col}_missing'] = df[col].isnull().astype(int)

print(f"\n=== With Missing Indicators ===")
print(df_flagged.columns.tolist())

Output:

=== Original Data ===
    age   income education     city
0  25.0  50000.0  Bachelor      NYC
1  30.0      NaN    Master       LA
2   NaN  65000.0  Bachelor      NaN
3  45.0      NaN       NaN  Chicago
4  50.0  90000.0       PhD      NYC
5  35.0  55000.0  Bachelor       LA
6   NaN  70000.0    Master      NYC
7  40.0      NaN  Bachelor      NaN
8  28.0  48000.0       NaN  Chicago
9  55.0  95000.0       PhD       LA

Missing values:
age          2
income       3
education    2
city         2

=== After Deletion ===
Rows remaining: 4 / 10

=== After Simple Imputation ===
    age   income education     city
0  25.0  50000.0  Bachelor      NYC
1  30.0  62500.0    Master       LA
2  37.5  65000.0  Bachelor      NYC
3  45.0  62500.0  Bachelor  Chicago
4  50.0  90000.0       PhD      NYC
5  35.0  55000.0  Bachelor       LA
6  37.5  70000.0    Master      NYC
7  40.0  62500.0  Bachelor      NYC
8  28.0  48000.0  Bachelor  Chicago
9  55.0  95000.0       PhD       LA

=== After KNN Imputation (numeric only) ===
    age    income
0  25.0   50000.0
1  30.0   51000.0
2  32.0   65000.0
3  45.0   78333.0
4  50.0   90000.0
5  35.0   55000.0
6  32.0   70000.0
7  40.0   58333.0
8  28.0   48000.0
9  55.0   95000.0

=== With Missing Indicators ===
['age', 'income', 'education', 'city', 'age_missing', 'income_missing', 'education_missing', 'city_missing']

Common Mistakes

Mistake 1: Imputing Before Train-Test Split

# WRONG: Impute on full dataset, then split
imputer.fit_transform(df)  # Uses test data statistics!
X_train, X_test = train_test_split(df)

# RIGHT: Split first, impute on train, transform test
X_train, X_test = train_test_split(df)
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)  # Uses TRAIN statistics

Mistake 2: Ignoring the Missingness Pattern

# WRONG: Just fill everything with the mean
df.fillna(df.mean())

# RIGHT: Investigate first
print(df.isnull().sum())
print(df[df['income'].isnull()]['education'].value_counts())  # Pattern?

Mistake 3: Using Mean for Skewed Data

# WRONG: Mean with outliers
salaries = [50000, 55000, 48000, 5000000]  # CEO!
mean_salary = np.mean(salaries)  # 1,288,250 - way too high!

# RIGHT: Median for skewed data
median_salary = np.median(salaries)  # 52,500 - reasonable

Mistake 4: Forgetting Categorical Variables

# WRONG: KNN imputer on categorical columns
knn.fit_transform(df)  # Breaks on 'city', 'education'

# RIGHT: Handle categorical separately
numeric_cols = df.select_dtypes(include=[np.number]).columns
categorical_cols = df.select_dtypes(include=['object']).columns

# Impute numeric with KNN
df[numeric_cols] = knn.fit_transform(df[numeric_cols])

# Impute categorical with mode
for col in categorical_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)

The Cheat Sheet

Method	When to Use	Pros	Cons
Delete rows	< 5% missing, MCAR	Simple, no bias	Lose data
Delete columns	> 50% missing	Removes noise	Lose features
Mean/Median	Quick analysis, MCAR	Fast, easy	Underestimates variance
Mode	Categorical data	Simple	Ignores patterns
Forward/Back fill	Time series	Preserves trends	Can propagate errors
Interpolation	Time series, smooth data	Natural estimates	Assumes continuity
KNN	MAR, good features	Uses similarity	Slow, curse of dimensionality
Model-based	Strong variable relationships	Most accurate	Complex, risk of leakage
Missing indicator	MNAR, ML models	Preserves information	Adds columns
Multiple imputation	Research, uncertainty	Gold standard	Complex, slow

Key Takeaways

Understand WHY data is missing — MCAR, MAR, MNAR require different approaches
Deletion is often the worst choice — You lose data AND might introduce bias
Mean imputation is quick but dangerous — It squishes variance
KNN imputation uses similar rows — Great when patterns exist
Missing indicators preserve information — The absence IS data
Impute AFTER train-test split — Avoid data leakage
Different columns need different strategies — Numeric vs categorical
Multiple imputation is the gold standard — For research and uncertainty

The One-Sentence Summary

Missing data isn't missing information — it's hidden information. Your job is to decode it, not delete it.

What's Next?

Now that you understand missing data handling, you're ready for:

Outlier Detection and Treatment — The other data cleaning challenge
Feature Encoding — Handling categorical variables
Data Normalization — Scaling features for models
Imbalanced Data — When classes aren't equal

Follow me for the next article in this series!

Let's Connect!

If this helped you handle missing data better, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

What's your go-to imputation method? I'm curious!

The difference between a dataset that trains a great model and one that trains garbage? Often just how you handled the missing values. Don't delete — investigate.

Share this with someone who's been dropping rows with NaN without thinking. Save their data. Save their model.

Happy cleaning!