The One-Line Summary: Missing data isn't a death sentence. You can delete it, fill it in, predict it, or flag it. The right choice depends on WHY it's missing and HOW MUCH is gone.
The Case of the Incomplete Census
Detective Rivera stared at the census forms spread across her desk.
The mayor needed population statistics by Monday. But something was wrong.
Half the forms were incomplete.
Some people left their income blank. Some skipped the age question. A few suspicious souls left entire pages empty.
Rivera had a decision to make.
Option 1: Throw Out the Incomplete Forms
"If they didn't fill it out completely, they don't count."
She could toss every incomplete form in the trash. Only use the perfect ones.
Problem: She'd lose half her data. The remaining sample might be biased. Maybe only wealthy people filled out the income field. Maybe only young people completed everything.
The statistics would be cleaner, but possibly wrong.
Option 2: Guess the Missing Values
"Income is blank? They probably earn the average. Age is blank? Let's say 35."
She could fill in the blanks with reasonable guesses.
Problem: Her guesses might be systematically wrong. If poor people skip the income question, filling in the average overstates their income.
The data would be complete, but possibly misleading.
Option 3: Investigate WHY It's Missing
"Why did they leave this blank? What does the absence tell us?"
She could treat the missing value as information itself. Maybe people who skip the income question ARE different — and that difference matters.
This is the smart approach.
Detective Rivera's dilemma is YOUR dilemma every time you open a dataset.
Missing data is everywhere. How you handle it determines whether your analysis is brilliant or broken.
Let me show you every tool in the detective's toolkit.
Why Is Data Missing?
Before you fix missing data, you need to understand WHY it's missing.
This isn't philosophical — it's practical. The reason determines the solution.
Type 1: Missing Completely at Random (MCAR)
What it means: The missingness has NOTHING to do with any data.
Example: A survey server crashed randomly, losing some responses. The lost responses aren't different from the kept ones — it was pure bad luck.
Good news: You can usually just delete these rows or fill them in. The remaining data is still representative.
Data lost because:
✓ Server crash
✓ Coffee spilled on paper forms
✓ Random transmission error
No pattern. Just chaos.
Type 2: Missing at Random (MAR)
What it means: The missingness is related to OTHER variables you can see, but not to the missing value itself.
Example: Young people skip the income question more often. The missingness relates to age (which you have), not to income itself.
Strategy: If you know the pattern, you can account for it. Use age to help predict the missing income.
Pattern you can SEE:
Age 18-25 → 40% skip income
Age 26-40 → 20% skip income
Age 40+ → 10% skip income
You can use age to help fill in the blanks.
Type 3: Missing Not at Random (MNAR)
What it means: The missingness is related to THE MISSING VALUE ITSELF.
Example: People with low income skip the income question because they're embarrassed. The very thing you're trying to measure causes the missingness.
Bad news: This is the hardest case. You can't fully recover the information because you don't know what you don't know.
Pattern you CAN'T see:
Low income → Skip income question
High income → Answer honestly
The missingness IS the information.
Filling in the average would be WRONG.
The Arsenal: Every Way to Handle Missing Data
Now let's get practical. Here are your options, from simple to sophisticated.
Method 1: Delete the Rows (Listwise Deletion)
The approach: Any row with a missing value? Gone.
import pandas as pd
# Before: 1000 rows, some with missing values
print(f"Before: {len(df)} rows")
# After: Only complete rows
df_clean = df.dropna()
print(f"After: {len(df_clean)} rows")
Visual:
Original:
[Name] [Age] [Income] [City]
[John] [32] [50000] [NYC] ✓ Keep
[Mary] [28] [ ? ] [LA] ✗ Delete (missing income)
[Bob] [45] [72000] [ ? ] ✗ Delete (missing city)
[Sara] [36] [61000] [Chicago] ✓ Keep
Result: Lost 50% of data!
When to Use It
✅ Data is MCAR (missing completely at random)
✅ Very few rows have missing values (< 5%)
✅ You have LOTS of data to spare
✅ Quick and dirty analysis
When to Avoid It
❌ Lots of missing data (you'll lose too much)
❌ Data is MAR or MNAR (you'll introduce bias)
❌ Missing values are informative
❌ Every row matters
Method 2: Delete the Columns
The approach: If a column has too many missing values, drop the entire column.
# Drop columns with more than 50% missing
threshold = len(df) * 0.5
df_clean = df.dropna(axis=1, thresh=threshold)
print(f"Columns before: {df.shape[1]}")
print(f"Columns after: {df_clean.shape[1]}")
Visual:
Original:
[Name] [Age] [Income] [FavoriteColor]
[John] [32] [50000] [?]
[Mary] [28] [ ? ] [?]
[Bob] [45] [72000] [?]
[Sara] [36] [61000] [Blue]
FavoriteColor: 75% missing → DROP THE COLUMN
Result:
[Name] [Age] [Income]
✓ No missing values in remaining columns!
When to Use It
✅ A column has massive missingness (> 50-70%)
✅ The column isn't critical to your analysis
✅ You have other columns with similar information
When to Avoid It
❌ The column is essential
❌ The missingness itself is informative
❌ You only have a few columns to begin with
Method 3: Fill with a Constant (Simple Imputation)
The approach: Replace missing values with a single value — mean, median, mode, or a custom constant.
# Fill with mean
df['income'].fillna(df['income'].mean(), inplace=True)
# Fill with median (better for skewed data)
df['income'].fillna(df['income'].median(), inplace=True)
# Fill with mode (for categorical)
df['city'].fillna(df['city'].mode()[0], inplace=True)
# Fill with a custom value
df['income'].fillna(0, inplace=True) # Or -1 as a flag
Visual:
Before:
Income: [50000, ?, 72000, ?, 61000, 55000]
Mean = 59,500
After (mean imputation):
Income: [50000, 59500, 72000, 59500, 61000, 55000]
↑ ↑
Filled with mean
Mean vs Median vs Mode
| Method | Use When |
|---|---|
| Mean | Data is roughly symmetric |
| Median | Data is skewed (outliers present) |
| Mode | Categorical data |
# Example: Salary data with CEO outlier
salaries = [50000, 55000, 48000, 52000, 5000000] # CEO!
mean = 1,041,000 # Pulled up by CEO
median = 52,000 # Unaffected by CEO
# Median is better here!
The Danger of Mean/Median Imputation
It underestimates variance.
Original variance: ████████████████████
After mean impute: ████████████
↑
You've squished the spread!
By filling with the average, you're making your data artificially "clumpy" around the center. This can break statistical tests and bias predictions.
Method 4: Fill Forward / Backward (Time Series)
The approach: For time-ordered data, use the previous (or next) value.
# Forward fill: Use the last known value
df['stock_price'].fillna(method='ffill', inplace=True)
# Backward fill: Use the next known value
df['stock_price'].fillna(method='bfill', inplace=True)
Visual:
Time series:
Day: 1 2 3 4 5 6
Price: 100 102 ? ? 108 110
Forward fill (ffill):
Price: 100 102 102 102 108 110
↑ ↑
Carried forward from day 2
Backward fill (bfill):
Price: 100 102 108 108 108 110
↑ ↑
Pulled back from day 5
When to Use It
✅ Time series data
✅ Values change slowly over time
✅ The last known value is a reasonable estimate
When to Avoid It
❌ Data isn't time-ordered
❌ Values can change dramatically between observations
❌ Long gaps in the data
Method 5: Interpolation (Time Series)
The approach: Estimate missing values based on surrounding values.
# Linear interpolation
df['temperature'].interpolate(method='linear', inplace=True)
# Polynomial interpolation (smoother)
df['temperature'].interpolate(method='polynomial', order=2, inplace=True)
# Time-based interpolation
df['temperature'].interpolate(method='time', inplace=True)
Visual:
Day: 1 2 3 4 5
Temp: 60 64 ? ? 76
Linear interpolation:
Temp: 60 64 68 72 76
↑ ↑
Evenly spaced between 64 and 76
Types of Interpolation
| Method | Shape | Use When |
|---|---|---|
| Linear | Straight line | Simple trends |
| Polynomial | Curved | Smooth, curved patterns |
| Spline | Very smooth curve | Complex patterns |
| Time-based | Accounts for timestamps | Irregular time intervals |
Method 6: K-Nearest Neighbors Imputation
The approach: Find similar rows (neighbors) and use their values.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(
imputer.fit_transform(df),
columns=df.columns
)
Visual:
Target row: [Age=30, Income=?, City=NYC, Education=Bachelor's]
Find 5 most similar rows:
1. [Age=31, Income=52000, City=NYC, Education=Bachelor's]
2. [Age=29, Income=48000, City=NYC, Education=Bachelor's]
3. [Age=32, Income=55000, City=NYC, Education=Master's]
4. [Age=28, Income=51000, City=Boston, Education=Bachelor's]
5. [Age=30, Income=49000, City=NYC, Education=Bachelor's]
Average their incomes: (52000+48000+55000+51000+49000)/5 = 51,000
Fill: [Age=30, Income=51000, City=NYC, Education=Bachelor's]
Why It's Smart
The imputed value is based on similar people, not everyone. A 30-year-old in NYC with a bachelor's degree probably earns like other 30-year-olds in NYC with bachelor's degrees — not like the overall average.
When to Use It
✅ Data is MAR (pattern depends on other variables)
✅ You have good features to define "similarity"
✅ Moderate amount of missing data
When to Avoid It
❌ High dimensionality (similarity becomes meaningless)
❌ Data is MNAR
❌ Computationally expensive for large datasets
Method 7: Model-Based Imputation
The approach: Train a model to PREDICT the missing values.
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
# Use Random Forest to predict missing values
imputer = IterativeImputer(
estimator=RandomForestRegressor(n_estimators=10, random_state=42),
max_iter=10,
random_state=42
)
df_imputed = pd.DataFrame(
imputer.fit_transform(df),
columns=df.columns
)
How It Works:
Step 1: For each column with missing values
Step 2: Treat it as the TARGET
Step 3: Use other columns as FEATURES
Step 4: Train a model on complete rows
Step 5: Predict missing values
Step 6: Repeat until convergence
Visual:
Predicting missing income:
Features: [Age, Education, City, JobTitle]
Target: [Income]
Train model on rows WITH income
Predict income for rows WITHOUT income
Income_missing = Model(Age, Education, City, JobTitle)
When to Use It
✅ Strong relationships between variables
✅ Moderate amount of missing data
✅ You need high accuracy
When to Avoid It
❌ Variables are independent
❌ Massive missingness
❌ Risk of data leakage in ML pipelines
Method 8: Create a "Missing" Indicator
The approach: Don't fill it. FLAG it.
# Create indicator column
df['income_missing'] = df['income'].isna().astype(int)
# Then fill the original (with mean, median, etc.)
df['income'].fillna(df['income'].median(), inplace=True)
Visual:
Before:
[Name] [Income]
[John] [50000]
[Mary] [ ? ]
[Bob] [72000]
After:
[Name] [Income] [Income_Missing]
[John] [50000] [0]
[Mary] [61000] [1] ← Filled + Flagged
[Bob] [72000] [0]
Why This Is Brilliant
Sometimes the fact that data is missing IS the information.
- People who skip income might be low earners (embarrassed)
- Missing medical tests might indicate the patient was too sick
- Empty survey fields might indicate disengagement
By flagging, your model can LEARN from the missingness pattern.
When to Use It
✅ Suspecting MNAR (missingness is informative)
✅ Building ML models
✅ Want to preserve the "missing" signal
Method 9: Multiple Imputation
The approach: Don't create ONE filled dataset. Create MANY.
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np
# Create multiple imputed datasets
n_imputations = 5
imputed_datasets = []
for i in range(n_imputations):
imputer = IterativeImputer(random_state=i, sample_posterior=True)
imputed = imputer.fit_transform(df)
imputed_datasets.append(imputed)
# Analyze each, then combine results
results = []
for dataset in imputed_datasets:
# Run your analysis on each
result = your_analysis(dataset)
results.append(result)
# Combine using Rubin's rules
final_estimate = np.mean(results)
Visual:
Original dataset (with missing):
[████ ? ████ ? ████]
Create 5 different filled versions:
[████ 52 ████ 47 ████] → Analysis 1 → Result 1
[████ 48 ████ 51 ████] → Analysis 2 → Result 2
[████ 55 ████ 49 ████] → Analysis 3 → Result 3
[████ 50 ████ 53 ████] → Analysis 4 → Result 4
[████ 51 ████ 48 ████] → Analysis 5 → Result 5
Combine: Average the 5 results
Why This Is the Gold Standard
Single imputation pretends you KNOW the missing value. You don't.
Multiple imputation acknowledges the UNCERTAINTY. By creating multiple plausible datasets, you capture the range of possibilities.
When to Use It
✅ Statistical research requiring valid inference
✅ Uncertainty quantification matters
✅ You need to report confidence intervals
The Decision Flowchart
Not sure which method to use? Follow this:
START
│
▼
How much data is missing?
│
├─ < 5% ───────────────────────────────────────┐
│ │
├─ 5-20% ──────────────────────────────┐ │
│ │ │
└─ > 20% ──────────────────┐ │ │
│ │ │
▼ ▼ ▼
Consider if What type Simple
column is of missing? deletion
worth keeping │ usually OK
│ │ │
▼ │ │
Maybe drop │ │
the column │ │
▼ │
┌─────────┴─────────┐ │
│ │ │
MCAR? MAR/MNAR?
│ │
▼ ▼
Mean/Median KNN, Model-based,
imputation OK or Indicator flag
Complete Code Example
Let's handle missing data in a real scenario:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Create sample data with missing values
np.random.seed(42)
df = pd.DataFrame({
'age': [25, 30, np.nan, 45, 50, 35, np.nan, 40, 28, 55],
'income': [50000, np.nan, 65000, np.nan, 90000, 55000, 70000, np.nan, 48000, 95000],
'education': ['Bachelor', 'Master', 'Bachelor', np.nan, 'PhD', 'Bachelor', 'Master', 'Bachelor', np.nan, 'PhD'],
'city': ['NYC', 'LA', np.nan, 'Chicago', 'NYC', 'LA', 'NYC', np.nan, 'Chicago', 'LA']
})
print("=== Original Data ===")
print(df)
print(f"\nMissing values:\n{df.isnull().sum()}")
# Strategy 1: Simple deletion
df_deleted = df.dropna()
print(f"\n=== After Deletion ===")
print(f"Rows remaining: {len(df_deleted)} / {len(df)}")
# Strategy 2: Mean/Median for numeric
df_simple = df.copy()
df_simple['age'].fillna(df_simple['age'].median(), inplace=True)
df_simple['income'].fillna(df_simple['income'].median(), inplace=True)
# Strategy 3: Mode for categorical
df_simple['education'].fillna(df_simple['education'].mode()[0], inplace=True)
df_simple['city'].fillna(df_simple['city'].mode()[0], inplace=True)
print(f"\n=== After Simple Imputation ===")
print(df_simple)
# Strategy 4: KNN for numeric columns
df_knn = df.copy()
numeric_cols = ['age', 'income']
knn_imputer = KNNImputer(n_neighbors=3)
df_knn[numeric_cols] = knn_imputer.fit_transform(df_knn[numeric_cols])
print(f"\n=== After KNN Imputation (numeric only) ===")
print(df_knn[numeric_cols].round(0))
# Strategy 5: Add missing indicators
df_flagged = df.copy()
for col in df.columns:
if df[col].isnull().any():
df_flagged[f'{col}_missing'] = df[col].isnull().astype(int)
print(f"\n=== With Missing Indicators ===")
print(df_flagged.columns.tolist())
Output:
=== Original Data ===
age income education city
0 25.0 50000.0 Bachelor NYC
1 30.0 NaN Master LA
2 NaN 65000.0 Bachelor NaN
3 45.0 NaN NaN Chicago
4 50.0 90000.0 PhD NYC
5 35.0 55000.0 Bachelor LA
6 NaN 70000.0 Master NYC
7 40.0 NaN Bachelor NaN
8 28.0 48000.0 NaN Chicago
9 55.0 95000.0 PhD LA
Missing values:
age 2
income 3
education 2
city 2
=== After Deletion ===
Rows remaining: 4 / 10
=== After Simple Imputation ===
age income education city
0 25.0 50000.0 Bachelor NYC
1 30.0 62500.0 Master LA
2 37.5 65000.0 Bachelor NYC
3 45.0 62500.0 Bachelor Chicago
4 50.0 90000.0 PhD NYC
5 35.0 55000.0 Bachelor LA
6 37.5 70000.0 Master NYC
7 40.0 62500.0 Bachelor NYC
8 28.0 48000.0 Bachelor Chicago
9 55.0 95000.0 PhD LA
=== After KNN Imputation (numeric only) ===
age income
0 25.0 50000.0
1 30.0 51000.0
2 32.0 65000.0
3 45.0 78333.0
4 50.0 90000.0
5 35.0 55000.0
6 32.0 70000.0
7 40.0 58333.0
8 28.0 48000.0
9 55.0 95000.0
=== With Missing Indicators ===
['age', 'income', 'education', 'city', 'age_missing', 'income_missing', 'education_missing', 'city_missing']
Common Mistakes
Mistake 1: Imputing Before Train-Test Split
# WRONG: Impute on full dataset, then split
imputer.fit_transform(df) # Uses test data statistics!
X_train, X_test = train_test_split(df)
# RIGHT: Split first, impute on train, transform test
X_train, X_test = train_test_split(df)
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test) # Uses TRAIN statistics
Mistake 2: Ignoring the Missingness Pattern
# WRONG: Just fill everything with the mean
df.fillna(df.mean())
# RIGHT: Investigate first
print(df.isnull().sum())
print(df[df['income'].isnull()]['education'].value_counts()) # Pattern?
Mistake 3: Using Mean for Skewed Data
# WRONG: Mean with outliers
salaries = [50000, 55000, 48000, 5000000] # CEO!
mean_salary = np.mean(salaries) # 1,288,250 - way too high!
# RIGHT: Median for skewed data
median_salary = np.median(salaries) # 52,500 - reasonable
Mistake 4: Forgetting Categorical Variables
# WRONG: KNN imputer on categorical columns
knn.fit_transform(df) # Breaks on 'city', 'education'
# RIGHT: Handle categorical separately
numeric_cols = df.select_dtypes(include=[np.number]).columns
categorical_cols = df.select_dtypes(include=['object']).columns
# Impute numeric with KNN
df[numeric_cols] = knn.fit_transform(df[numeric_cols])
# Impute categorical with mode
for col in categorical_cols:
df[col].fillna(df[col].mode()[0], inplace=True)
The Cheat Sheet
| Method | When to Use | Pros | Cons |
|---|---|---|---|
| Delete rows | < 5% missing, MCAR | Simple, no bias | Lose data |
| Delete columns | > 50% missing | Removes noise | Lose features |
| Mean/Median | Quick analysis, MCAR | Fast, easy | Underestimates variance |
| Mode | Categorical data | Simple | Ignores patterns |
| Forward/Back fill | Time series | Preserves trends | Can propagate errors |
| Interpolation | Time series, smooth data | Natural estimates | Assumes continuity |
| KNN | MAR, good features | Uses similarity | Slow, curse of dimensionality |
| Model-based | Strong variable relationships | Most accurate | Complex, risk of leakage |
| Missing indicator | MNAR, ML models | Preserves information | Adds columns |
| Multiple imputation | Research, uncertainty | Gold standard | Complex, slow |
Key Takeaways
Understand WHY data is missing — MCAR, MAR, MNAR require different approaches
Deletion is often the worst choice — You lose data AND might introduce bias
Mean imputation is quick but dangerous — It squishes variance
KNN imputation uses similar rows — Great when patterns exist
Missing indicators preserve information — The absence IS data
Impute AFTER train-test split — Avoid data leakage
Different columns need different strategies — Numeric vs categorical
Multiple imputation is the gold standard — For research and uncertainty
The One-Sentence Summary
Missing data isn't missing information — it's hidden information. Your job is to decode it, not delete it.
What's Next?
Now that you understand missing data handling, you're ready for:
- Outlier Detection and Treatment — The other data cleaning challenge
- Feature Encoding — Handling categorical variables
- Data Normalization — Scaling features for models
- Imbalanced Data — When classes aren't equal
Follow me for the next article in this series!
Let's Connect!
If this helped you handle missing data better, drop a heart!
Questions? Ask in the comments — I read and respond to every one.
What's your go-to imputation method? I'm curious!
The difference between a dataset that trains a great model and one that trains garbage? Often just how you handled the missing values. Don't delete — investigate.
Share this with someone who's been dropping rows with NaN without thinking. Save their data. Save their model.
Happy cleaning!
Top comments (0)