The One-Line Summary: Outliers are data points that don't fit the pattern. They're either precious insights, dangerous errors, or rare but real phenomena. Your job is to figure out which — and handle each accordingly.
The Zookeeper's Database Disaster
You're the new data analyst at the Metropolitan Zoo.
Your first task: Verify the animal weight database.
You pull up the penguin records:
Penguin ID Weight (kg)
──────────────────────────
PEN001 8.2
PEN002 7.5
PEN003 9.1
PEN004 3,247.0 ← 🤔
PEN005 7.8
PEN006 8.4
PEN007 0.003 ← 🤔
PEN008 8.0
You stare at PEN004: 3,247 kg.
That's not a penguin. That's a small car. Emperor penguins max out at around 45 kg.
You stare at PEN007: 0.003 kg.
That's 3 grams. A penguin EGG weighs more than that.
The Four Possibilities
For each outlier, exactly one of these is true:
Possibility 1: Data Entry Error 📝
Someone typed 3247 instead of 32.47. Or 0.003 instead of 8.003.
Action: Fix it if you can find the true value. Remove it if you can't.
Possibility 2: Measurement Error 📏
The scale malfunctioned. Or someone weighed the penguin while it was holding a fish. Or standing on another penguin.
Action: Remove or re-measure.
Possibility 3: Wrong Category 🏷️
PEN004 isn't a penguin at all — someone tagged an elephant with a penguin ID. PEN007 might be a penguin feather sample, not a whole penguin.
Action: Investigate and recategorize.
Possibility 4: Real But Rare 🦖
Maybe, just maybe, this is a legitimate record. A mutant penguin. An undiscovered species. A miracle of nature.
Action: Keep it! This might be the most valuable data point you have.
This is the outlier dilemma.
You can't just blindly delete outliers. You can't blindly keep them either. You need to INVESTIGATE, UNDERSTAND, and then DECIDE.
Let me show you how.
What Exactly Is an Outlier?
An outlier is a data point that differs significantly from other observations.
Normal distribution with outliers:
┌─ Outlier (too high)
│
▼
│ ●
│
│ ╭────╮
│ ╭─╯ ╰─╮
│ ╭─╯ ╰─╮
│ ╭─╯ ╰─╮
│ ╭─╯ ╰─╮
│ ╭─╯ ╰─╮
────┴─╯────────────────────────╰───●────
▲
│
Outlier (too low)
But "significantly different" is subjective. Let's make it concrete.
Detection Method 1: The Z-Score
The idea: How many standard deviations away from the mean?
Z-score = (X - mean) / std
If |Z| > 3, it's an outlier (common threshold)
Interpretation:
- Z = 0 → Exactly average
- Z = 1 → One standard deviation above average
- Z = 3 → Three standard deviations above (very rare!)
- Z = -2 → Two standard deviations below
import numpy as np
from scipy import stats
# Penguin weights
weights = np.array([8.2, 7.5, 9.1, 3247.0, 7.8, 8.4, 0.003, 8.0])
# Calculate Z-scores
z_scores = stats.zscore(weights)
# Find outliers (|Z| > 3)
outliers = np.abs(z_scores) > 3
print("Weight Z-Score Outlier?")
print("-" * 35)
for w, z, is_out in zip(weights, z_scores, outliers):
print(f"{w:>8.3f} {z:>7.2f} {'YES 🚨' if is_out else 'No'}")
Output:
Weight Z-Score Outlier?
-----------------------------------
8.200 -0.28 No
7.500 -0.28 No
9.100 -0.28 No
3247.000 2.83 No ← Wait, what?!
7.800 -0.28 No
8.400 -0.28 No
0.003 -0.28 No ← This too?!
8.000 -0.28 No
Wait, why didn't it catch the obvious outliers?
Because Z-score uses mean and standard deviation, which are themselves DESTROYED by outliers!
The 3,247 kg penguin pulled the mean up to ~400 kg and inflated the std to ~1,100 kg. Now nothing looks unusual relative to this corrupted baseline.
Z-score is sensitive to the very outliers it's trying to detect!
Detection Method 2: The IQR Method (Robust!)
The idea: Use median and quartiles instead of mean and std. These are ROBUST to outliers.
IQR = Q3 - Q1 (Interquartile Range)
Lower bound = Q1 - 1.5 × IQR
Upper bound = Q3 + 1.5 × IQR
Anything outside these bounds is an outlier.
Visual:
Q1 Median Q3
│ │ │
──────────┼───────────┼──────────┼──────────
│◀──── IQR ─────▶│
│ │
◀──────┼───────────────────────┼──────▶
1.5×IQR 1.5×IQR
│ │
Lower Bound Upper Bound
│ │
●──────┼───────────────────────┼──────●
Outlier │ Normal Range │ Outlier
import numpy as np
weights = np.array([8.2, 7.5, 9.1, 3247.0, 7.8, 8.4, 0.003, 8.0])
# Calculate IQR
Q1 = np.percentile(weights, 25)
Q3 = np.percentile(weights, 75)
IQR = Q3 - Q1
# Calculate bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print(f"Q1: {Q1:.2f}, Q3: {Q3:.2f}, IQR: {IQR:.2f}")
print(f"Lower bound: {lower_bound:.2f}")
print(f"Upper bound: {upper_bound:.2f}")
print()
# Find outliers
print("Weight Outlier?")
print("-" * 25)
for w in weights:
is_outlier = w < lower_bound or w > upper_bound
print(f"{w:>10.3f} {'YES 🚨' if is_outlier else 'No'}")
Output:
Q1: 7.69, Q3: 8.35, IQR: 0.66
Lower bound: 6.70
Upper bound: 9.34
Weight Outlier?
-------------------------
8.200 No
7.500 No
9.100 No
3247.000 YES 🚨
7.800 No
8.400 No
0.003 YES 🚨
8.000 No
Now it works! The IQR method correctly identified both suspicious penguins.
Detection Method 3: Modified Z-Score (Best of Both)
The idea: Z-score concept, but using median and MAD (Median Absolute Deviation) instead of mean and std.
MAD = median(|X - median(X)|)
Modified Z = 0.6745 × (X - median) / MAD
If |Modified Z| > 3.5, it's an outlier
import numpy as np
def modified_z_score(data):
median = np.median(data)
mad = np.median(np.abs(data - median))
modified_z = 0.6745 * (data - median) / mad
return modified_z
weights = np.array([8.2, 7.5, 9.1, 3247.0, 7.8, 8.4, 0.003, 8.0])
mod_z = modified_z_score(weights)
outliers = np.abs(mod_z) > 3.5
print("Weight Mod Z-Score Outlier?")
print("-" * 40)
for w, z, is_out in zip(weights, mod_z, outliers):
print(f"{w:>10.3f} {z:>10.2f} {'YES 🚨' if is_out else 'No'}")
Output:
Weight Mod Z-Score Outlier?
----------------------------------------
8.200 0.54 No
7.500 -0.67 No
9.100 2.09 No
3247.000 5765.24 YES 🚨
7.800 0.00 No
8.400 1.08 No
0.003 -13.88 YES 🚨
8.000 0.36 No
The 3,247 kg "penguin" has a modified Z-score of 5,765. Yeah, that's not a penguin.
Detection Method 4: Isolation Forest (ML-Based)
The idea: Outliers are easier to "isolate" with random splits. Train a forest to find them.
from sklearn.ensemble import IsolationForest
import numpy as np
# Reshape for sklearn
weights = np.array([8.2, 7.5, 9.1, 3247.0, 7.8, 8.4, 0.003, 8.0]).reshape(-1, 1)
# Fit Isolation Forest
iso_forest = IsolationForest(contamination=0.2, random_state=42)
predictions = iso_forest.fit_predict(weights)
# -1 = outlier, 1 = normal
print("Weight Prediction")
print("-" * 25)
for w, pred in zip(weights.flatten(), predictions):
status = "OUTLIER 🚨" if pred == -1 else "Normal"
print(f"{w:>10.3f} {status}")
Output:
Weight Prediction
-------------------------
8.200 Normal
7.500 Normal
9.100 Normal
3247.000 OUTLIER 🚨
7.800 Normal
8.400 Normal
0.003 OUTLIER 🚨
8.000 Normal
When to use: High-dimensional data where simple statistics don't work well.
Detection Method 5: DBSCAN (Density-Based)
The idea: Outliers are points in low-density regions.
from sklearn.cluster import DBSCAN
import numpy as np
weights = np.array([8.2, 7.5, 9.1, 3247.0, 7.8, 8.4, 0.003, 8.0]).reshape(-1, 1)
# DBSCAN clustering
dbscan = DBSCAN(eps=1.0, min_samples=2)
labels = dbscan.fit_predict(weights)
# -1 = noise (outlier)
print("Weight Cluster")
print("-" * 25)
for w, label in zip(weights.flatten(), labels):
status = "OUTLIER 🚨" if label == -1 else f"Cluster {label}"
print(f"{w:>10.3f} {status}")
Output:
Weight Cluster
-------------------------
8.200 Cluster 0
7.500 Cluster 0
9.100 Cluster 0
3247.000 OUTLIER 🚨
7.800 Cluster 0
8.400 Cluster 0
0.003 OUTLIER 🚨
8.000 Cluster 0
Visual Detection: Box Plots and Scatter Plots
Sometimes your eyes are the best detector.
import matplotlib.pyplot as plt
import numpy as np
weights = np.array([8.2, 7.5, 9.1, 3247.0, 7.8, 8.4, 0.003, 8.0])
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Box Plot
axes[0].boxplot(weights, vert=True)
axes[0].set_title('Box Plot - Outliers Visible!', fontsize=14)
axes[0].set_ylabel('Weight (kg)')
# Scatter Plot
axes[1].scatter(range(len(weights)), weights, s=100, c='blue', alpha=0.7)
axes[1].axhline(y=np.median(weights), color='red', linestyle='--', label='Median')
axes[1].set_title('Scatter Plot - Spot the Anomalies!', fontsize=14)
axes[1].set_xlabel('Penguin ID')
axes[1].set_ylabel('Weight (kg)')
axes[1].legend()
plt.tight_layout()
plt.savefig('outlier_visualization.png', dpi=150)
plt.show()
Visual intuition is powerful. A box plot instantly reveals outliers as points beyond the whiskers.
Now What? Handling the Outliers
You've found them. Now what do you do with them?
Option 1: Remove Them
When: You're confident they're errors.
import numpy as np
def remove_outliers_iqr(data, column):
Q1 = data[column].quantile(0.25)
Q3 = data[column].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
return data[(data[column] >= lower) & (data[column] <= upper)]
# Remove penguin weight outliers
df_clean = remove_outliers_iqr(df, 'weight')
print(f"Before: {len(df)} rows")
print(f"After: {len(df_clean)} rows")
⚠️ Warning: You're losing data! Make sure they're truly errors.
Option 2: Cap/Winsorize Them
When: You want to keep the data point but limit its influence.
Winsorizing: Replace outliers with the nearest "normal" value.
from scipy.stats import mstats
import numpy as np
weights = np.array([8.2, 7.5, 9.1, 3247.0, 7.8, 8.4, 0.003, 8.0])
# Winsorize at 5th and 95th percentiles
winsorized = mstats.winsorize(weights, limits=[0.05, 0.05])
print("Original Winsorized")
print("-" * 25)
for orig, wins in zip(weights, winsorized):
print(f"{orig:>10.3f} {wins:>10.3f}")
Output:
Original Winsorized
-------------------------
8.200 8.200
7.500 7.500
9.100 9.100
3247.000 9.100 ← Capped to 95th percentile!
7.800 7.800
8.400 8.400
0.003 7.500 ← Raised to 5th percentile!
8.000 8.000
The 3,247 kg penguin becomes 9.1 kg (the maximum "normal" penguin).
Option 3: Transform the Data
When: Outliers exist because of skewed distributions.
import numpy as np
# Original skewed data (incomes with a billionaire)
incomes = np.array([50000, 55000, 48000, 62000, 51000, 5000000000]) # $5 billion!
# Log transform compresses the scale
log_incomes = np.log1p(incomes) # log(1 + x) handles zeros
print("Original Income Log Transformed")
print("-" * 40)
for orig, log_val in zip(incomes, log_incomes):
print(f"${orig:>15,} {log_val:>10.2f}")
Output:
Original Income Log Transformed
----------------------------------------
$ 50,000 10.82
$ 55,000 10.92
$ 48,000 10.78
$ 62,000 11.03
$ 51,000 10.84
$ 5,000,000,000 22.33
The billionaire is still the highest, but the gap is now manageable (22 vs 11 instead of 5,000,000,000 vs 50,000).
Option 4: Impute Them
When: You believe the outlier is an error but want to keep the row.
import numpy as np
weights = np.array([8.2, 7.5, 9.1, 3247.0, 7.8, 8.4, 0.003, 8.0])
# Identify outliers using IQR
Q1, Q3 = np.percentile(weights, [25, 75])
IQR = Q3 - Q1
lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
# Replace outliers with median
median = np.median(weights)
weights_imputed = np.where(
(weights < lower) | (weights > upper),
median, # Replace with median
weights # Keep original
)
print("Original Imputed")
print("-" * 25)
for orig, imp in zip(weights, weights_imputed):
changed = " ← replaced!" if orig != imp else ""
print(f"{orig:>10.3f} {imp:>8.3f}{changed}")
Output:
Original Imputed
-------------------------
8.200 8.200
7.500 7.500
9.100 9.100
3247.000 8.000 ← replaced!
7.800 7.800
8.400 8.400
0.003 8.000 ← replaced!
8.000 8.000
Option 5: Separate Model for Outliers
When: Outliers are legitimate but behave differently.
# Split data into normal and outlier segments
normal_mask = (df['weight'] >= lower) & (df['weight'] <= upper)
df_normal = df[normal_mask]
df_outliers = df[~normal_mask]
# Train separate models!
model_normal = train_model(df_normal)
model_outliers = train_model(df_outliers)
# At prediction time, route to appropriate model
def predict(row):
if is_outlier(row):
return model_outliers.predict(row)
else:
return model_normal.predict(row)
Option 6: Use Robust Algorithms
When: You want the model to handle outliers automatically.
Some algorithms are naturally resistant to outliers:
| Algorithm | Outlier Robust? | Why |
|---|---|---|
| Linear Regression | ❌ No | Minimizes squared error (outliers dominate) |
| RANSAC Regression | ✅ Yes | Ignores outliers during fitting |
| Huber Regression | ✅ Yes | Linear for small errors, constant for large |
| Decision Trees | ✅ Yes | Splits on thresholds, not affected by magnitude |
| Median-based stats | ✅ Yes | Median ignores extreme values |
from sklearn.linear_model import HuberRegressor, RANSACRegressor, LinearRegression
# Compare on data with outliers
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
y = np.array([2, 4, 6, 8, 10, 12, 14, 16, 18, 500]) # Last point is outlier!
# Standard Linear Regression (affected by outlier)
lr = LinearRegression().fit(X, y)
print(f"Linear Regression slope: {lr.coef_[0]:.2f}")
# Huber Regression (robust)
huber = HuberRegressor().fit(X, y)
print(f"Huber Regression slope: {huber.coef_[0]:.2f}")
# RANSAC Regression (very robust)
ransac = RANSACRegressor().fit(X, y)
print(f"RANSAC Regression slope: {ransac.estimator_.coef_[0]:.2f}")
Output:
Linear Regression slope: 47.05 ← Completely wrong! (should be ~2)
Huber Regression slope: 2.00 ← Correct!
RANSAC Regression slope: 2.00 ← Correct!
The outlier (y=500) destroyed Linear Regression but barely affected Huber and RANSAC.
Option 7: Flag and Investigate
When: You're not sure if outliers are errors or insights.
def flag_outliers(df, column, method='iqr'):
"""Add outlier flags without removing data."""
if method == 'iqr':
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
df[f'{column}_outlier'] = (df[column] < lower) | (df[column] > upper)
elif method == 'zscore':
z = np.abs(stats.zscore(df[column]))
df[f'{column}_outlier'] = z > 3
return df
# Flag without removing
df = flag_outliers(df, 'weight', method='iqr')
# Now you can investigate manually
print(df[df['weight_outlier'] == True])
The Decision Framework
OUTLIER DETECTED
│
▼
Is it a data entry / measurement error?
│
┌───┴───┐
│ │
YES NO (or unsure)
│ │
▼ ▼
Can you Is it a legitimate rare event?
find the │
true value? │
│ ┌───┴───┐
┌──┴──┐ │ │
│ │ YES NO
│ │ │ │
▼ ▼ ▼ ▼
FIX REMOVE KEEP! Does it break your model?
IT IT This might │
be valuable! ┌──┴──┐
│ │
YES NO
│ │
▼ ▼
Transform Keep
Cap/Clip as-is
or use
robust model
Complete Code: The Outlier Handling Pipeline
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.ensemble import IsolationForest
class OutlierHandler:
"""Complete outlier detection and handling pipeline."""
def __init__(self, method='iqr', threshold=1.5):
self.method = method
self.threshold = threshold
self.bounds_ = {}
def detect(self, df, columns):
"""Detect outliers in specified columns."""
outlier_mask = pd.DataFrame(index=df.index)
for col in columns:
if self.method == 'iqr':
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - self.threshold * IQR
upper = Q3 + self.threshold * IQR
self.bounds_[col] = (lower, upper)
outlier_mask[col] = (df[col] < lower) | (df[col] > upper)
elif self.method == 'zscore':
z = np.abs(stats.zscore(df[col]))
outlier_mask[col] = z > self.threshold
elif self.method == 'isolation_forest':
iso = IsolationForest(contamination=0.1, random_state=42)
preds = iso.fit_predict(df[[col]])
outlier_mask[col] = preds == -1
return outlier_mask
def remove(self, df, columns):
"""Remove rows with outliers."""
mask = self.detect(df, columns)
any_outlier = mask.any(axis=1)
return df[~any_outlier].copy()
def cap(self, df, columns):
"""Cap outliers to boundary values."""
df = df.copy()
self.detect(df, columns) # Calculate bounds
for col in columns:
lower, upper = self.bounds_[col]
df[col] = df[col].clip(lower=lower, upper=upper)
return df
def impute_median(self, df, columns):
"""Replace outliers with median."""
df = df.copy()
mask = self.detect(df, columns)
for col in columns:
median = df[col].median()
df.loc[mask[col], col] = median
return df
def flag(self, df, columns):
"""Add outlier flag columns."""
df = df.copy()
mask = self.detect(df, columns)
for col in columns:
df[f'{col}_is_outlier'] = mask[col]
return df
# Usage example
np.random.seed(42)
df = pd.DataFrame({
'weight': [8.2, 7.5, 9.1, 3247.0, 7.8, 8.4, 0.003, 8.0],
'height': [45, 48, 52, 47, 46, 49, 44, 150], # 150 is outlier
'id': range(8)
})
print("=== Original Data ===")
print(df)
print()
handler = OutlierHandler(method='iqr', threshold=1.5)
# Detect
print("=== Outlier Detection ===")
outliers = handler.detect(df, ['weight', 'height'])
print(outliers)
print()
# Different handling strategies
print("=== Strategy 1: Remove ===")
df_removed = handler.remove(df, ['weight', 'height'])
print(f"Rows: {len(df)} → {len(df_removed)}")
print()
print("=== Strategy 2: Cap ===")
df_capped = handler.cap(df, ['weight', 'height'])
print(df_capped[['weight', 'height']])
print()
print("=== Strategy 3: Impute Median ===")
df_imputed = handler.impute_median(df, ['weight', 'height'])
print(df_imputed[['weight', 'height']])
print()
print("=== Strategy 4: Flag ===")
df_flagged = handler.flag(df, ['weight', 'height'])
print(df_flagged)
Output:
=== Original Data ===
weight height id
0 8.20 45 0
1 7.50 48 1
2 9.10 52 2
3 3247.00 47 3
4 7.80 46 4
5 8.40 49 5
6 0.00 44 6
7 8.00 150 7
=== Outlier Detection ===
weight height
0 False False
1 False False
2 False False
3 True False
4 False False
5 False False
6 True False
7 False True
=== Strategy 1: Remove ===
Rows: 8 → 5
=== Strategy 2: Cap ===
weight height
0 8.20 45.0
1 7.50 48.0
2 9.10 52.0
3 9.34 47.0 ← Capped!
4 7.80 46.0
5 8.40 49.0
6 6.70 44.0 ← Capped!
7 8.00 55.5 ← Capped!
=== Strategy 3: Impute Median ===
weight height
0 8.2 45.0
1 7.5 48.0
2 9.1 52.0
3 8.0 47.0 ← Replaced with median!
4 7.8 46.0
5 8.4 49.0
6 8.0 44.0 ← Replaced with median!
7 8.0 47.0 ← Replaced with median!
Common Mistakes
Mistake 1: Removing All Outliers Blindly
# ❌ WRONG: Delete everything beyond 3 std
df = df[np.abs(stats.zscore(df['value'])) < 3]
# You might be deleting valid rare events!
# ✅ RIGHT: Investigate first
outliers = df[np.abs(stats.zscore(df['value'])) >= 3]
print("Outliers found:")
print(outliers)
# Then decide case by case
Mistake 2: Using Z-Score on Skewed Data
# ❌ WRONG: Z-score on income data (heavily skewed)
z_scores = stats.zscore(income_data)
# Z-score assumes normal distribution!
# ✅ RIGHT: Use IQR or log-transform first
log_income = np.log1p(income_data)
z_scores = stats.zscore(log_income)
# Or just use IQR which doesn't assume normality
Mistake 3: Treating All Outliers the Same
# ❌ WRONG: One rule for all outliers
df = remove_all_outliers(df)
# ✅ RIGHT: Different strategies for different causes
df = investigate_and_handle(df, column='weight', reason='entry_error')
df = keep_but_flag(df, column='income', reason='legitimate_billionaire')
df = cap_values(df, column='age', reason='data_anonymization')
Mistake 4: Forgetting to Handle Outliers in Test Data
# ❌ WRONG: Handle outliers only in training
df_train = remove_outliers(df_train)
# Test data still has outliers!
# ✅ RIGHT: Consistent handling using training statistics
handler = OutlierHandler()
handler.fit(df_train) # Learn bounds from training
df_train_clean = handler.transform(df_train)
df_test_clean = handler.transform(df_test) # Apply same rules!
Quick Reference: Detection Methods
| Method | Robust to Outliers? | Best For | Threshold |
|---|---|---|---|
| Z-Score | ❌ No | Normal data, few outliers | \ |
| Modified Z-Score | ✅ Yes | General use | \ |
| IQR | ✅ Yes | Any distribution | 1.5 × IQR |
| Isolation Forest | ✅ Yes | High dimensions | contamination param |
| DBSCAN | ✅ Yes | Clustered data | eps, min_samples |
| Visual (Box Plot) | N/A | Initial exploration | Human judgment |
Key Takeaways
Outliers aren't always errors — They might be your most valuable data
Investigate before acting — Is it an error, rare event, or different category?
IQR is more robust than Z-score — Z-score is corrupted by the very outliers it detects
Multiple handling strategies exist — Remove, cap, transform, impute, flag, or use robust models
Use domain knowledge — A 3,000 kg penguin is obviously wrong; a $5M salary might be real
Be consistent — Apply the same rules to train and test data
Document your decisions — Future you will thank present you
Visual inspection helps — Sometimes your eyes are the best detector
The One-Sentence Summary
The 3,000 kg penguin in your dataset is either a data entry error, a mislabeled elephant, or a discovery that will make you famous — your job is to figure out which before your model learns that all penguins are the size of cars.
What's Next?
Now that you understand outlier detection, you're ready for:
- Data Transformation — Log, Box-Cox, and power transforms
- Anomaly Detection Systems — Building production outlier detection
- Robust Statistics — Median, MAD, and trimmed means
- Data Quality Pipelines — Automated data validation
Follow me for the next article in this series!
Let's Connect!
If this saved you from trusting a 3,000 kg penguin, drop a heart!
Questions? Ask in the comments — I read and respond to every one.
What's the strangest outlier you've ever found? Share your stories!
The difference between a model that predicts penguin weights accurately and one that thinks penguins weigh as much as elephants? Knowing when that 3,247 kg data point is a typo vs. a scientific breakthrough. Investigate. Decide. Then act.
Share this with someone who's been deleting outliers without asking why. Their model (and their penguins) will thank you.
Happy detecting! 🐧
Top comments (0)