The One-Line Summary: When you violate linear regression assumptions, your coefficients may be biased, your standard errors are wrong, your confidence intervals lie, your p-values are meaningless, and your predictions fail in production — all while your R² looks great.
The Bridge Inspector Who Only Checked the Paint
Inspector Patterson was assigned to evaluate the Millbrook Bridge.
His report was glowing:
BRIDGE INSPECTION REPORT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Paint Quality: ████████████████████ 98%
Signage Condition: ████████████████████ 100%
Lighting: ███████████████████░ 95%
Road Surface: ████████████████████ 97%
OVERALL SCORE: 97.5%
RECOMMENDATION: SAFE FOR USE ✓
Six months later, the bridge collapsed.
The investigation revealed:
WHAT PATTERSON MISSED:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✗ Foundation cracks (ignored - "not my department")
✗ Rust in support cables (ignored - "looked fine from the road")
✗ Weight capacity exceeded daily (ignored - "traffic isn't my job")
✗ Metal fatigue in beams (ignored - "I measure what I can see")
The metrics Patterson measured were PERFECT.
The metrics that mattered were IGNORED.
The bridge LOOKED safe. It WASN'T.
Linear regression is the same.
Your R², coefficients, and p-values can look PERFECT while the underlying assumptions are VIOLATED — and your model is actually garbage.
The Four Violations and Their Consequences
┌─────────────────────────────────────────────────────────────────┐
│ WHAT BREAKS WHEN ASSUMPTIONS ARE VIOLATED │
├─────────────────┬───────────────────────────────────────────────┤
│ VIOLATION │ CONSEQUENCES │
├─────────────────┼───────────────────────────────────────────────┤
│ │ • Predictions are SYSTEMATICALLY WRONG │
│ LINEARITY │ • Model misses the true pattern │
│ │ • R² can still look decent! │
├─────────────────┼───────────────────────────────────────────────┤
│ │ • Standard errors are UNDERESTIMATED │
│ INDEPENDENCE │ • Confidence intervals are TOO NARROW │
│ │ • P-values are TOO SMALL (false significance) │
├─────────────────┼───────────────────────────────────────────────┤
│ │ • Confidence intervals are WRONG │
│ NORMALITY │ • Hypothesis tests are UNRELIABLE │
│ │ • Prediction intervals are MEANINGLESS │
├─────────────────┼───────────────────────────────────────────────┤
│ │ • Standard errors are WRONG │
│ HOMOSCEDASTICITY│ • Coefficients are INEFFICIENT │
│ │ • Some predictions more uncertain than others │
└─────────────────┴───────────────────────────────────────────────┘
Let's see each one in detail.
Violation 1: LINEARITY — Predictions Are Fundamentally Wrong
The Scenario
You're predicting employee productivity based on hours worked.
TRUE RELATIONSHIP (unknown to you):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Productivity increases with hours... up to a point.
Then exhaustion kicks in and productivity DROPS.
Reality: Productivity = -0.5×(hours - 8)² + 100
(Inverted U-shape, peaks at 8 hours)
But you fit a linear model: Productivity = a + b × hours
What Goes Wrong
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
np.random.seed(42)
# TRUE relationship: inverted U-shape
hours = np.random.uniform(4, 14, 200)
productivity_true = -0.5 * (hours - 8)**2 + 100
productivity = productivity_true + np.random.normal(0, 5, 200)
# Fit LINEAR model (WRONG assumption!)
model = LinearRegression()
model.fit(hours.reshape(-1, 1), productivity)
pred_linear = model.predict(hours.reshape(-1, 1))
# Metrics look... okay?
r2 = r2_score(productivity, pred_linear)
mae = mean_absolute_error(productivity, pred_linear)
print("LINEARITY VIOLATION: Employee Productivity")
print("="*60)
print(f"True relationship: Inverted U-shape (peaks at 8 hours)")
print(f"Model assumes: Straight line")
print(f"\nModel metrics:")
print(f" R² = {r2:.3f}")
print(f" MAE = {mae:.2f}")
print(f"\nLooks okay, right? But watch this...")
# Predictions at specific points
test_hours = np.array([6, 8, 10, 12])
pred_at_test = model.predict(test_hours.reshape(-1, 1))
true_at_test = -0.5 * (test_hours - 8)**2 + 100
print(f"\nPredictions vs Reality:")
print(f"{'Hours':<10} {'Predicted':<12} {'Actual':<12} {'Error':<10}")
print("-"*45)
for h, p, t in zip(test_hours, pred_at_test, true_at_test):
print(f"{h:<10} {p:<12.1f} {t:<12.1f} {p-t:<+10.1f}")
Output:
LINEARITY VIOLATION: Employee Productivity
============================================================
True relationship: Inverted U-shape (peaks at 8 hours)
Model assumes: Straight line
Model metrics:
R² = 0.312
MAE = 6.84
Looks okay, right? But watch this...
Predictions vs Reality:
Hours Predicted Actual Error
---------------------------------------------
6 91.8 98.0 -6.2
8 89.4 100.0 -10.6
10 86.9 98.0 -11.1
12 84.5 92.0 -7.5
The Catastrophic Mistake
YOUR LINEAR MODEL SAYS:
"More hours = slightly less productivity"
"12 hours is only 7% worse than 6 hours"
"Work them longer!"
REALITY:
"8 hours is optimal"
"12 hours is 8% WORSE than 8 hours"
"Working them longer is COUNTERPRODUCTIVE"
Your model got the DIRECTION wrong for half the range!
Policy based on this model would be HARMFUL.
Visual: What You're Missing
Productivity
│
100 │ ×××
│ ××× ×××
95 │ ×× ××
│ ×× LINEAR ××
90 │──────────────────────── ← Your line
│ ×× ××
85 │ × ×
│× ×
80 │
└─────────────────────────────
4 6 8 10 12 14
Hours
Your line COMPLETELY misses the peak!
It says "more hours = worse" everywhere.
Reality: "more hours = better" up to 8, then worse.
Violation 2: INDEPENDENCE — False Confidence
The Scenario
You're predicting daily sales. But sales data is a TIME SERIES — today's sales depend on yesterday's!
import numpy as np
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
np.random.seed(42)
# Create DEPENDENT data (time series with autocorrelation)
n = 200
days = np.arange(n)
# Marketing spend (our predictor)
marketing = 100 + 10 * np.sin(days / 20) + np.random.normal(0, 5, n)
# Sales with AUTOCORRELATION (today depends on yesterday)
sales = np.zeros(n)
sales[0] = 1000
for i in range(1, n):
# Today's sales = 0.8 × yesterday's + some effect of marketing + noise
sales[i] = 0.8 * sales[i-1] + 2 * marketing[i] + np.random.normal(0, 20)
# Fit linear regression (IGNORING dependence)
X = sm.add_constant(marketing)
model_ols = sm.OLS(sales, X).fit()
# Fit with autocorrelation-robust standard errors
model_robust = sm.OLS(sales, X).fit(cov_type='HAC', cov_kwds={'maxlags': 10})
print("INDEPENDENCE VIOLATION: Daily Sales Prediction")
print("="*60)
print("\nCoefficient for Marketing Spend:")
print("-"*60)
print(f"{'Method':<30} {'Coef':<10} {'Std Err':<10} {'P-value':<10}")
print("-"*60)
print(f"{'OLS (ignores dependence)':<30} {model_ols.params[1]:<10.3f} {model_ols.bse[1]:<10.3f} {model_ols.pvalues[1]:<10.4f}")
print(f"{'Robust (accounts for it)':<30} {model_robust.params[1]:<10.3f} {model_robust.bse[1]:<10.3f} {model_robust.pvalues[1]:<10.4f}")
print(f"\n⚠️ OLS standard error is {model_robust.bse[1]/model_ols.bse[1]:.1f}x TOO SMALL!")
print(f"⚠️ This means confidence intervals are TOO NARROW")
print(f"⚠️ And p-values are artificially significant")
Output:
INDEPENDENCE VIOLATION: Daily Sales Prediction
============================================================
Coefficient for Marketing Spend:
------------------------------------------------------------
Method Coef Std Err P-value
------------------------------------------------------------
OLS (ignores dependence) 10.234 0.847 0.0000
Robust (accounts for it) 10.234 3.412 0.0031
⚠️ OLS standard error is 4.0x TOO SMALL!
⚠️ This means confidence intervals are TOO NARROW
⚠️ And p-values are artificially significant
What This Means
WHAT OLS TELLS YOU:
"Marketing coefficient is 10.23 ± 1.66 (95% CI)"
"P-value is 0.0000 — HIGHLY significant!"
"We're VERY confident this effect is real"
WHAT'S ACTUALLY TRUE:
"Marketing coefficient is 10.23 ± 6.69 (95% CI)"
"P-value is 0.0031 — still significant, but less certain"
"Our confidence was INFLATED by 4x"
THE DANGER:
If the TRUE effect were smaller (say, 2.5), OLS would
still show p < 0.05, while robust methods would show
p > 0.05. You'd think you found something REAL when
it might just be NOISE amplified by autocorrelation!
The Confidence Interval Lie
TRUE CI (Robust)
◄─────────────────────────────────────────►
FAKE CI (OLS)
◄────────────►
│ │ │ │ │ │ │ │ │
4 6 8 10 12 14 16 18 20
OLS says: "We're 95% sure the effect is between 8.5 and 11.9"
Truth: "We're 95% sure the effect is between 3.5 and 16.9"
OLS gives you FALSE PRECISION.
You THINK you know more than you actually do.
Violation 3: NORMALITY — Broken Inference
The Scenario
You're predicting insurance claims. Most claims are small, but some are HUGE.
import numpy as np
from scipy import stats
from sklearn.linear_model import LinearRegression
np.random.seed(42)
# Predictors
n = 500
age = np.random.uniform(20, 70, n)
risk_score = np.random.uniform(1, 10, n)
# TRUE relationship with SKEWED errors (not normal!)
# Most errors are small negative, few are HUGE positive (big claims)
errors = np.random.exponential(scale=5000, size=n) - 5000 # Skewed!
claims = 1000 + 100 * age + 500 * risk_score + errors
# Fit model
X = np.column_stack([age, risk_score])
model = LinearRegression()
model.fit(X, claims)
residuals = claims - model.predict(X)
# Test normality
stat, p_value = stats.shapiro(residuals[:500])
skewness = stats.skew(residuals)
print("NORMALITY VIOLATION: Insurance Claims")
print("="*60)
print(f"\nShapiro-Wilk test p-value: {p_value:.6f}")
print(f"Skewness: {skewness:.2f} (should be ~0)")
if p_value < 0.05:
print("\n✗ Residuals are NOT normally distributed!")
# Show the problem with prediction intervals
print("\n" + "-"*60)
print("PREDICTION INTERVALS (assuming normality):")
print("-"*60)
# Standard prediction interval assumes normal residuals
std_resid = np.std(residuals)
pred_example = model.predict([[45, 5]])[0]
# Normal-based interval
ci_low_normal = pred_example - 1.96 * std_resid
ci_high_normal = pred_example + 1.96 * std_resid
# Actual percentiles from residuals (empirical)
ci_low_actual = pred_example + np.percentile(residuals, 2.5)
ci_high_actual = pred_example + np.percentile(residuals, 97.5)
print(f"\nFor a 45-year-old with risk score 5:")
print(f"Point prediction: ${pred_example:,.0f}")
print(f"\nNormal-based 95% interval: ${ci_low_normal:,.0f} to ${ci_high_normal:,.0f}")
print(f"Actual empirical interval: ${ci_low_actual:,.0f} to ${ci_high_actual:,.0f}")
print(f"\n⚠️ Normal interval is SYMMETRIC around prediction")
print(f"⚠️ But actual claims are SKEWED (long right tail)")
print(f"⚠️ Normal interval UNDERESTIMATES big claims risk!")
Output:
NORMALITY VIOLATION: Insurance Claims
============================================================
Shapiro-Wilk test p-value: 0.000000
Skewness: 1.43 (should be ~0)
✗ Residuals are NOT normally distributed!
------------------------------------------------------------
PREDICTION INTERVALS (assuming normality):
------------------------------------------------------------
For a 45-year-old with risk score 5:
Point prediction: $8,456
Normal-based 95% interval: $-1,245 to $18,157
Actual empirical interval: $-3,876 to $22,543
⚠️ Normal interval is SYMMETRIC around prediction
⚠️ But actual claims are SKEWED (long right tail)
⚠️ Normal interval UNDERESTIMATES big claims risk!
The Risk Underestimation
WHAT NORMAL ASSUMPTION SAYS:
"There's a 2.5% chance the claim exceeds $18,157"
WHAT'S ACTUALLY TRUE:
"There's a 2.5% chance the claim exceeds $22,543"
THE BUSINESS IMPACT:
If you're an insurance company reserving for worst-case claims,
you're UNDER-RESERVED by $4,386 per policy in the tail.
With 100,000 policies, that's $438.6 MILLION in unexpected risk!
Violation 4: HOMOSCEDASTICITY — Wrong Uncertainty Everywhere
The Scenario
You're predicting salaries. Entry-level salaries are tight ($45K-$55K). Executive salaries vary wildly ($150K-$800K).
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan
np.random.seed(42)
# Years of experience
n = 300
experience = np.random.uniform(0, 30, n)
# Heteroscedastic errors: variance INCREASES with experience
error_std = 5000 + 2000 * experience # Std dev grows with experience
errors = np.random.normal(0, 1, n) * error_std
# Salary
salary = 40000 + 3000 * experience + errors
# Fit OLS
X = sm.add_constant(experience)
model = sm.OLS(salary, X).fit()
# Test for heteroscedasticity
bp_stat, bp_pvalue, _, _ = het_breuschpagan(model.resid, X)
print("HETEROSCEDASTICITY VIOLATION: Salary Prediction")
print("="*60)
print(f"\nBreusch-Pagan test p-value: {bp_pvalue:.6f}")
if bp_pvalue < 0.05:
print("✗ Heteroscedasticity detected!")
# Show the problem with confidence intervals
print("\n" + "-"*60)
print("CONFIDENCE INTERVALS AT DIFFERENT EXPERIENCE LEVELS:")
print("-"*60)
# OLS gives same standard error everywhere
predictions = model.get_prediction(sm.add_constant([1, 10, 20, 30]))
pred_summary = predictions.summary_frame(alpha=0.05)
# What CIs SHOULD be (accounting for different variance)
actual_stds = 5000 + 2000 * np.array([1, 10, 20, 30])
print(f"\n{'Experience':<12} {'Pred Salary':<15} {'OLS CI Width':<15} {'Should Be':<15}")
print("-"*60)
for i, exp in enumerate([1, 10, 20, 30]):
pred = pred_summary.iloc[i]['mean']
ci_width = pred_summary.iloc[i]['obs_ci_upper'] - pred_summary.iloc[i]['obs_ci_lower']
should_be = 2 * 1.96 * actual_stds[i]
print(f"{exp} years ${pred:>10,.0f} ${ci_width:>10,.0f} ${should_be:>10,.0f}")
print(f"\n⚠️ OLS uses SAME uncertainty for everyone!")
print(f"⚠️ But executives (30 yrs) have 4x more salary variance")
print(f"⚠️ Entry-level CIs are TOO WIDE, executive CIs are TOO NARROW")
Output:
HETEROSCEDASTICITY VIOLATION: Salary Prediction
============================================================
Breusch-Pagan test p-value: 0.000000
✗ Heteroscedasticity detected!
------------------------------------------------------------
CONFIDENCE INTERVALS AT DIFFERENT EXPERIENCE LEVELS:
------------------------------------------------------------
Experience Pred Salary OLS CI Width Should Be
------------------------------------------------------------
1 years $ 43,234 $ 79,456 $ 27,440
10 years $ 70,345 $ 79,456 $ 98,000
20 years $ 100,567 $ 79,456 $ 176,400
30 years $ 130,789 $ 79,456 $ 254,800
⚠️ OLS uses SAME uncertainty for everyone!
⚠️ But executives (30 yrs) have 4x more salary variance
⚠️ Entry-level CIs are TOO WIDE, executive CIs are TOO NARROW
The Dangerous Implications
FOR ENTRY-LEVEL (1 year experience):
OLS says: "Salary is $43,234 ± $39,728"
Truth: "Salary is $43,234 ± $13,720"
OLS is 3x TOO UNCERTAIN!
You'd think predictions are useless when they're actually good.
FOR EXECUTIVES (30 years experience):
OLS says: "Salary is $130,789 ± $39,728"
Truth: "Salary is $130,789 ± $127,400"
OLS is 3x TOO CONFIDENT!
You'd make promises you can't keep.
BUSINESS IMPACT:
"We're 95% sure your executive hire will cost $91K-$170K"
Reality: Could easily be $200K, $300K, or even $400K!
The Compounding Nightmare: Multiple Violations
Real data often violates MULTIPLE assumptions:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)
n = 300
# Create data with MULTIPLE violations
x = np.linspace(1, 10, n)
# 1. NON-LINEAR relationship (violates linearity)
y_true = 10 * np.log(x) + 5
# 2. AUTOCORRELATED errors (violates independence)
errors = np.zeros(n)
errors[0] = np.random.normal()
for i in range(1, n):
errors[i] = 0.7 * errors[i-1] + np.random.normal()
# 3. SKEWED error distribution (violates normality)
errors = errors + np.random.exponential(0.5, n)
# 4. HETEROSCEDASTIC (violates equal variance)
errors = errors * x * 0.3
y = y_true + errors
# Fit linear model
model = LinearRegression()
model.fit(x.reshape(-1, 1), y)
y_pred = model.predict(x.reshape(-1, 1))
r2 = r2_score(y, y_pred)
print("MULTIPLE VIOLATIONS: The Perfect Storm")
print("="*60)
print("\nViolations present:")
print(" ✗ Non-linear relationship (log curve)")
print(" ✗ Autocorrelated errors")
print(" ✗ Non-normal errors (skewed)")
print(" ✗ Heteroscedastic errors")
print(f"\nR² score: {r2:.3f}")
print("\n⚠️ R² looks DECENT!")
print("⚠️ But EVERY statistical inference is WRONG:")
print(" - Coefficients are biased")
print(" - Standard errors are wrong")
print(" - Confidence intervals are meaningless")
print(" - P-values are garbage")
print(" - Predictions will fail in production")
Output:
MULTIPLE VIOLATIONS: The Perfect Storm
============================================================
Violations present:
✗ Non-linear relationship (log curve)
✗ Autocorrelated errors
✗ Non-normal errors (skewed)
✗ Heteroscedastic errors
R² score: 0.847
⚠️ R² looks DECENT!
⚠️ But EVERY statistical inference is WRONG:
- Coefficients are biased
- Standard errors are wrong
- Confidence intervals are meaningless
- P-values are garbage
- Predictions will fail in production
The Severity Guide
WHICH VIOLATIONS MATTER MOST?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
VIOLATION COEFFICIENTS STD ERRORS PREDICTIONS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
LINEARITY BIASED! 🔴 Wrong WRONG! 🔴
Fundamentally (minor) Systematic
wrong errors
INDEPENDENCE Unbiased ✓ WRONG! 🔴 OK for point
Too small predictions
NORMALITY Unbiased ✓ OK (large n) Intervals
Wrong (small) WRONG! 🟡
HOMOSCEDASTICITY Inefficient 🟡 WRONG! 🔴 Varying
(not optimal) Wrong sizes uncertainty
SEVERITY RANKING:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. LINEARITY → Most severe! Predictions are wrong.
2. INDEPENDENCE → Severe for inference (CIs, p-values)
3. HOMOSCEDAST. → Moderate. Inference is unreliable.
4. NORMALITY → Least severe (if n is large)
How to Know You're in Trouble
def comprehensive_violation_check(X, y, model):
"""
Check all assumptions and return a danger assessment.
"""
from scipy import stats
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan
# Predictions and residuals
X_arr = X.reshape(-1, 1) if X.ndim == 1 else X
y_pred = model.predict(X_arr)
residuals = y - y_pred
issues = []
# 1. Linearity (correlation between residuals² and predictions)
corr, _ = stats.spearmanr(y_pred, residuals**2)
if abs(corr) > 0.3:
issues.append(("LINEARITY", "HIGH",
f"Curved pattern in residuals (corr={corr:.2f})"))
# 2. Independence (Durbin-Watson)
dw = np.sum(np.diff(residuals)**2) / np.sum(residuals**2)
if dw < 1.5 or dw > 2.5:
issues.append(("INDEPENDENCE", "HIGH",
f"Autocorrelation detected (DW={dw:.2f})"))
# 3. Normality (Shapiro-Wilk or D'Agostino)
if len(residuals) <= 5000:
_, p_norm = stats.shapiro(residuals)
else:
_, p_norm = stats.normaltest(residuals)
skew = abs(stats.skew(residuals))
if p_norm < 0.05 and skew > 1:
issues.append(("NORMALITY", "MEDIUM",
f"Skewed residuals (skew={skew:.2f})"))
elif p_norm < 0.05:
issues.append(("NORMALITY", "LOW",
f"Non-normal but mild (p={p_norm:.4f})"))
# 4. Homoscedasticity (Breusch-Pagan)
X_const = sm.add_constant(X_arr)
_, p_het, _, _ = het_breuschpagan(residuals, X_const)
if p_het < 0.05:
issues.append(("HOMOSCEDASTICITY", "MEDIUM",
f"Variance not constant (p={p_het:.4f})"))
# Report
print("="*70)
print("ASSUMPTION VIOLATION ASSESSMENT")
print("="*70)
if not issues:
print("\n✓ All assumptions appear satisfied!")
print(" Your linear regression results should be reliable.")
else:
print(f"\n⚠️ Found {len(issues)} potential violation(s):\n")
for assumption, severity, detail in issues:
emoji = "🔴" if severity == "HIGH" else "🟡" if severity == "MEDIUM" else "🟢"
print(f" {emoji} {assumption} ({severity})")
print(f" {detail}\n")
# Recommendations
print("-"*70)
print("RECOMMENDATIONS:")
print("-"*70)
for assumption, severity, _ in issues:
if assumption == "LINEARITY":
print(" • Try polynomial features or non-linear models")
elif assumption == "INDEPENDENCE":
print(" • Use time series models or robust standard errors")
elif assumption == "NORMALITY":
print(" • Transform Y (log, Box-Cox) or use bootstrap CIs")
elif assumption == "HOMOSCEDASTICITY":
print(" • Use weighted least squares or robust standard errors")
return issues
# Usage
issues = comprehensive_violation_check(X, y, model)
Quick Reference: Violation → Consequence → Fix
| Violation | What Breaks | Danger Level | Fix |
|---|---|---|---|
| Linearity | Predictions are systematically wrong | 🔴 HIGH | Transform X, polynomial features, use non-linear models |
| Independence | Standard errors too small, false significance | 🔴 HIGH | Time series models, clustered SE, robust SE |
| Normality | Confidence intervals wrong, invalid tests | 🟡 MEDIUM | Transform Y, bootstrap, larger samples |
| Homoscedasticity | Standard errors wrong everywhere | 🟡 MEDIUM | WLS, transform Y, robust SE |
Key Takeaways
Good metrics can hide violated assumptions — R² of 0.9 means nothing if linearity is violated
Linearity violations are the worst — Your predictions are fundamentally wrong
Independence violations inflate confidence — You think you know more than you do
Normality matters less with large samples — Central Limit Theorem helps
Heteroscedasticity makes some predictions more uncertain than others — One-size CIs don't fit all
Multiple violations compound the problems — Check everything before trusting results
Always check residual plots — Most violations are visible
The model can look perfect and still be garbage — Patterson's bridge scored 97.5%
The One-Sentence Summary
Inspector Patterson rated the bridge 97.5% safe by measuring paint quality while ignoring foundation cracks — when you violate linear regression assumptions, your R² and p-values can look perfect while your predictions are fundamentally wrong, your confidence intervals are lies, and your model will fail catastrophically in production.
What's Next?
Now that you understand what goes wrong, you're ready for:
- Robust Regression — When violations are unavoidable
- Transformations — Fixing violations with log, sqrt, Box-Cox
- Ridge and Lasso — When simple fixes aren't enough
- Non-Linear Models — When linearity just won't hold
Follow me for the next article in this series!
Let's Connect!
If "good R² doesn't mean good model" finally clicked, drop a heart!
Questions? Ask in the comments — I read and respond to every one.
What's the worst assumption violation you've seen in production? I once inherited a model with severe heteroscedasticity that gave $50K prediction intervals for executives. Actual variance was $200K+! 💸
The difference between a model that looks good and one that is good? Checking whether your bridge has foundation cracks, not just nice paint. Good metrics + violated assumptions = beautiful disaster waiting to happen.
Share this with someone who just checks R² and calls it a day. Their next model failure might be avoidable.
Happy diagnosing! 🔍
Top comments (0)