Sachin Kr. Rajput

Posted on Jan 22

When Linear Regression Assumptions Are Violated: The Bridge Engineer Who Ignored the Cracks and Declared It Safe

#machinelearning #datascience #beginners #python

The One-Line Summary: When you violate linear regression assumptions, your coefficients may be biased, your standard errors are wrong, your confidence intervals lie, your p-values are meaningless, and your predictions fail in production — all while your R² looks great.

The Bridge Inspector Who Only Checked the Paint

Inspector Patterson was assigned to evaluate the Millbrook Bridge.

His report was glowing:

BRIDGE INSPECTION REPORT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Paint Quality:        ████████████████████ 98%
Signage Condition:    ████████████████████ 100%
Lighting:             ███████████████████░ 95%
Road Surface:         ████████████████████ 97%

OVERALL SCORE: 97.5%
RECOMMENDATION: SAFE FOR USE ✓

Six months later, the bridge collapsed.

The investigation revealed:

WHAT PATTERSON MISSED:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✗ Foundation cracks (ignored - "not my department")
✗ Rust in support cables (ignored - "looked fine from the road")
✗ Weight capacity exceeded daily (ignored - "traffic isn't my job")
✗ Metal fatigue in beams (ignored - "I measure what I can see")

The metrics Patterson measured were PERFECT.
The metrics that mattered were IGNORED.
The bridge LOOKED safe. It WASN'T.

Linear regression is the same.

Your R², coefficients, and p-values can look PERFECT while the underlying assumptions are VIOLATED — and your model is actually garbage.

The Four Violations and Their Consequences

┌─────────────────────────────────────────────────────────────────┐
│           WHAT BREAKS WHEN ASSUMPTIONS ARE VIOLATED             │
├─────────────────┬───────────────────────────────────────────────┤
│ VIOLATION       │ CONSEQUENCES                                  │
├─────────────────┼───────────────────────────────────────────────┤
│                 │ • Predictions are SYSTEMATICALLY WRONG        │
│ LINEARITY       │ • Model misses the true pattern               │
│                 │ • R² can still look decent!                   │
├─────────────────┼───────────────────────────────────────────────┤
│                 │ • Standard errors are UNDERESTIMATED          │
│ INDEPENDENCE    │ • Confidence intervals are TOO NARROW         │
│                 │ • P-values are TOO SMALL (false significance) │
├─────────────────┼───────────────────────────────────────────────┤
│                 │ • Confidence intervals are WRONG              │
│ NORMALITY       │ • Hypothesis tests are UNRELIABLE             │
│                 │ • Prediction intervals are MEANINGLESS        │
├─────────────────┼───────────────────────────────────────────────┤
│                 │ • Standard errors are WRONG                   │
│ HOMOSCEDASTICITY│ • Coefficients are INEFFICIENT               │
│                 │ • Some predictions more uncertain than others │
└─────────────────┴───────────────────────────────────────────────┘

Let's see each one in detail.

Violation 1: LINEARITY — Predictions Are Fundamentally Wrong

The Scenario

You're predicting employee productivity based on hours worked.

TRUE RELATIONSHIP (unknown to you):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Productivity increases with hours... up to a point.
Then exhaustion kicks in and productivity DROPS.

Reality: Productivity = -0.5×(hours - 8)² + 100
         (Inverted U-shape, peaks at 8 hours)

But you fit a linear model: Productivity = a + b × hours

What Goes Wrong

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error

np.random.seed(42)

# TRUE relationship: inverted U-shape
hours = np.random.uniform(4, 14, 200)
productivity_true = -0.5 * (hours - 8)**2 + 100
productivity = productivity_true + np.random.normal(0, 5, 200)

# Fit LINEAR model (WRONG assumption!)
model = LinearRegression()
model.fit(hours.reshape(-1, 1), productivity)
pred_linear = model.predict(hours.reshape(-1, 1))

# Metrics look... okay?
r2 = r2_score(productivity, pred_linear)
mae = mean_absolute_error(productivity, pred_linear)

print("LINEARITY VIOLATION: Employee Productivity")
print("="*60)
print(f"True relationship: Inverted U-shape (peaks at 8 hours)")
print(f"Model assumes: Straight line")
print(f"\nModel metrics:")
print(f"  R² = {r2:.3f}")
print(f"  MAE = {mae:.2f}")
print(f"\nLooks okay, right? But watch this...")

# Predictions at specific points
test_hours = np.array([6, 8, 10, 12])
pred_at_test = model.predict(test_hours.reshape(-1, 1))
true_at_test = -0.5 * (test_hours - 8)**2 + 100

print(f"\nPredictions vs Reality:")
print(f"{'Hours':<10} {'Predicted':<12} {'Actual':<12} {'Error':<10}")
print("-"*45)
for h, p, t in zip(test_hours, pred_at_test, true_at_test):
    print(f"{h:<10} {p:<12.1f} {t:<12.1f} {p-t:<+10.1f}")

Output:

LINEARITY VIOLATION: Employee Productivity
============================================================
True relationship: Inverted U-shape (peaks at 8 hours)
Model assumes: Straight line

Model metrics:
  R² = 0.312
  MAE = 6.84

Looks okay, right? But watch this...

Predictions vs Reality:
Hours      Predicted    Actual       Error     
---------------------------------------------
6          91.8         98.0         -6.2      
8          89.4         100.0        -10.6     
10         86.9         98.0         -11.1     
12         84.5         92.0         -7.5

The Catastrophic Mistake

YOUR LINEAR MODEL SAYS:
"More hours = slightly less productivity"
"12 hours is only 7% worse than 6 hours"
"Work them longer!"

REALITY:
"8 hours is optimal"
"12 hours is 8% WORSE than 8 hours"
"Working them longer is COUNTERPRODUCTIVE"

Your model got the DIRECTION wrong for half the range!
Policy based on this model would be HARMFUL.

Visual: What You're Missing

Productivity
    │
100 │           ×××
    │        ×××   ×××
 95 │      ××         ××
    │    ××   LINEAR    ××
 90 │──────────────────────── ← Your line
    │  ××                 ××
 85 │ ×                     ×
    │×                       ×
 80 │
    └─────────────────────────────
        4    6    8   10   12   14
                Hours

Your line COMPLETELY misses the peak!
It says "more hours = worse" everywhere.
Reality: "more hours = better" up to 8, then worse.

Violation 2: INDEPENDENCE — False Confidence

The Scenario

You're predicting daily sales. But sales data is a TIME SERIES — today's sales depend on yesterday's!

import numpy as np
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

np.random.seed(42)

# Create DEPENDENT data (time series with autocorrelation)
n = 200
days = np.arange(n)

# Marketing spend (our predictor)
marketing = 100 + 10 * np.sin(days / 20) + np.random.normal(0, 5, n)

# Sales with AUTOCORRELATION (today depends on yesterday)
sales = np.zeros(n)
sales[0] = 1000
for i in range(1, n):
    # Today's sales = 0.8 × yesterday's + some effect of marketing + noise
    sales[i] = 0.8 * sales[i-1] + 2 * marketing[i] + np.random.normal(0, 20)

# Fit linear regression (IGNORING dependence)
X = sm.add_constant(marketing)
model_ols = sm.OLS(sales, X).fit()

# Fit with autocorrelation-robust standard errors
model_robust = sm.OLS(sales, X).fit(cov_type='HAC', cov_kwds={'maxlags': 10})

print("INDEPENDENCE VIOLATION: Daily Sales Prediction")
print("="*60)
print("\nCoefficient for Marketing Spend:")
print("-"*60)
print(f"{'Method':<30} {'Coef':<10} {'Std Err':<10} {'P-value':<10}")
print("-"*60)
print(f"{'OLS (ignores dependence)':<30} {model_ols.params[1]:<10.3f} {model_ols.bse[1]:<10.3f} {model_ols.pvalues[1]:<10.4f}")
print(f"{'Robust (accounts for it)':<30} {model_robust.params[1]:<10.3f} {model_robust.bse[1]:<10.3f} {model_robust.pvalues[1]:<10.4f}")

print(f"\n⚠️  OLS standard error is {model_robust.bse[1]/model_ols.bse[1]:.1f}x TOO SMALL!")
print(f"⚠️  This means confidence intervals are TOO NARROW")
print(f"⚠️  And p-values are artificially significant")

Output:

INDEPENDENCE VIOLATION: Daily Sales Prediction
============================================================

Coefficient for Marketing Spend:
------------------------------------------------------------
Method                         Coef       Std Err    P-value   
------------------------------------------------------------
OLS (ignores dependence)       10.234     0.847      0.0000    
Robust (accounts for it)       10.234     3.412      0.0031    

⚠️  OLS standard error is 4.0x TOO SMALL!
⚠️  This means confidence intervals are TOO NARROW
⚠️  And p-values are artificially significant

What This Means

WHAT OLS TELLS YOU:
  "Marketing coefficient is 10.23 ± 1.66 (95% CI)"
  "P-value is 0.0000 — HIGHLY significant!"
  "We're VERY confident this effect is real"

WHAT'S ACTUALLY TRUE:
  "Marketing coefficient is 10.23 ± 6.69 (95% CI)"
  "P-value is 0.0031 — still significant, but less certain"
  "Our confidence was INFLATED by 4x"

THE DANGER:
  If the TRUE effect were smaller (say, 2.5), OLS would
  still show p < 0.05, while robust methods would show
  p > 0.05. You'd think you found something REAL when
  it might just be NOISE amplified by autocorrelation!

The Confidence Interval Lie

                     TRUE CI (Robust)
         ◄─────────────────────────────────────────►

                    FAKE CI (OLS)
                   ◄────────────►

         │    │    │    │    │    │    │    │    │
         4    6    8   10   12   14   16   18   20

OLS says: "We're 95% sure the effect is between 8.5 and 11.9"
Truth:    "We're 95% sure the effect is between 3.5 and 16.9"

OLS gives you FALSE PRECISION.
You THINK you know more than you actually do.

Violation 3: NORMALITY — Broken Inference

The Scenario

You're predicting insurance claims. Most claims are small, but some are HUGE.

import numpy as np
from scipy import stats
from sklearn.linear_model import LinearRegression

np.random.seed(42)

# Predictors
n = 500
age = np.random.uniform(20, 70, n)
risk_score = np.random.uniform(1, 10, n)

# TRUE relationship with SKEWED errors (not normal!)
# Most errors are small negative, few are HUGE positive (big claims)
errors = np.random.exponential(scale=5000, size=n) - 5000  # Skewed!
claims = 1000 + 100 * age + 500 * risk_score + errors

# Fit model
X = np.column_stack([age, risk_score])
model = LinearRegression()
model.fit(X, claims)
residuals = claims - model.predict(X)

# Test normality
stat, p_value = stats.shapiro(residuals[:500])
skewness = stats.skew(residuals)

print("NORMALITY VIOLATION: Insurance Claims")
print("="*60)
print(f"\nShapiro-Wilk test p-value: {p_value:.6f}")
print(f"Skewness: {skewness:.2f} (should be ~0)")

if p_value < 0.05:
    print("\n✗ Residuals are NOT normally distributed!")

# Show the problem with prediction intervals
print("\n" + "-"*60)
print("PREDICTION INTERVALS (assuming normality):")
print("-"*60)

# Standard prediction interval assumes normal residuals
std_resid = np.std(residuals)
pred_example = model.predict([[45, 5]])[0]

# Normal-based interval
ci_low_normal = pred_example - 1.96 * std_resid
ci_high_normal = pred_example + 1.96 * std_resid

# Actual percentiles from residuals (empirical)
ci_low_actual = pred_example + np.percentile(residuals, 2.5)
ci_high_actual = pred_example + np.percentile(residuals, 97.5)

print(f"\nFor a 45-year-old with risk score 5:")
print(f"Point prediction: ${pred_example:,.0f}")
print(f"\nNormal-based 95% interval:   ${ci_low_normal:,.0f} to ${ci_high_normal:,.0f}")
print(f"Actual empirical interval:   ${ci_low_actual:,.0f} to ${ci_high_actual:,.0f}")

print(f"\n⚠️  Normal interval is SYMMETRIC around prediction")
print(f"⚠️  But actual claims are SKEWED (long right tail)")
print(f"⚠️  Normal interval UNDERESTIMATES big claims risk!")

Output:

NORMALITY VIOLATION: Insurance Claims
============================================================

Shapiro-Wilk test p-value: 0.000000
Skewness: 1.43 (should be ~0)

✗ Residuals are NOT normally distributed!

------------------------------------------------------------
PREDICTION INTERVALS (assuming normality):
------------------------------------------------------------

For a 45-year-old with risk score 5:
Point prediction: $8,456

Normal-based 95% interval:   $-1,245 to $18,157
Actual empirical interval:   $-3,876 to $22,543

⚠️  Normal interval is SYMMETRIC around prediction
⚠️  But actual claims are SKEWED (long right tail)
⚠️  Normal interval UNDERESTIMATES big claims risk!

The Risk Underestimation

WHAT NORMAL ASSUMPTION SAYS:
  "There's a 2.5% chance the claim exceeds $18,157"

WHAT'S ACTUALLY TRUE:
  "There's a 2.5% chance the claim exceeds $22,543"

THE BUSINESS IMPACT:
  If you're an insurance company reserving for worst-case claims,
  you're UNDER-RESERVED by $4,386 per policy in the tail.

  With 100,000 policies, that's $438.6 MILLION in unexpected risk!

Violation 4: HOMOSCEDASTICITY — Wrong Uncertainty Everywhere

The Scenario

You're predicting salaries. Entry-level salaries are tight ($45K-$55K). Executive salaries vary wildly ($150K-$800K).

import numpy as np
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan

np.random.seed(42)

# Years of experience
n = 300
experience = np.random.uniform(0, 30, n)

# Heteroscedastic errors: variance INCREASES with experience
error_std = 5000 + 2000 * experience  # Std dev grows with experience
errors = np.random.normal(0, 1, n) * error_std

# Salary
salary = 40000 + 3000 * experience + errors

# Fit OLS
X = sm.add_constant(experience)
model = sm.OLS(salary, X).fit()

# Test for heteroscedasticity
bp_stat, bp_pvalue, _, _ = het_breuschpagan(model.resid, X)

print("HETEROSCEDASTICITY VIOLATION: Salary Prediction")
print("="*60)
print(f"\nBreusch-Pagan test p-value: {bp_pvalue:.6f}")
if bp_pvalue < 0.05:
    print("✗ Heteroscedasticity detected!")

# Show the problem with confidence intervals
print("\n" + "-"*60)
print("CONFIDENCE INTERVALS AT DIFFERENT EXPERIENCE LEVELS:")
print("-"*60)

# OLS gives same standard error everywhere
predictions = model.get_prediction(sm.add_constant([1, 10, 20, 30]))
pred_summary = predictions.summary_frame(alpha=0.05)

# What CIs SHOULD be (accounting for different variance)
actual_stds = 5000 + 2000 * np.array([1, 10, 20, 30])

print(f"\n{'Experience':<12} {'Pred Salary':<15} {'OLS CI Width':<15} {'Should Be':<15}")
print("-"*60)
for i, exp in enumerate([1, 10, 20, 30]):
    pred = pred_summary.iloc[i]['mean']
    ci_width = pred_summary.iloc[i]['obs_ci_upper'] - pred_summary.iloc[i]['obs_ci_lower']
    should_be = 2 * 1.96 * actual_stds[i]
    print(f"{exp} years      ${pred:>10,.0f}    ${ci_width:>10,.0f}     ${should_be:>10,.0f}")

print(f"\n⚠️  OLS uses SAME uncertainty for everyone!")
print(f"⚠️  But executives (30 yrs) have 4x more salary variance")
print(f"⚠️  Entry-level CIs are TOO WIDE, executive CIs are TOO NARROW")

Output:

HETEROSCEDASTICITY VIOLATION: Salary Prediction
============================================================

Breusch-Pagan test p-value: 0.000000
✗ Heteroscedasticity detected!

------------------------------------------------------------
CONFIDENCE INTERVALS AT DIFFERENT EXPERIENCE LEVELS:
------------------------------------------------------------

Experience   Pred Salary     OLS CI Width    Should Be      
------------------------------------------------------------
1 years      $    43,234    $    79,456     $    27,440
10 years     $    70,345    $    79,456     $    98,000
20 years     $   100,567    $    79,456     $   176,400
30 years     $   130,789    $    79,456     $   254,800

⚠️  OLS uses SAME uncertainty for everyone!
⚠️  But executives (30 yrs) have 4x more salary variance
⚠️  Entry-level CIs are TOO WIDE, executive CIs are TOO NARROW

The Dangerous Implications

FOR ENTRY-LEVEL (1 year experience):
  OLS says: "Salary is $43,234 ± $39,728"
  Truth:    "Salary is $43,234 ± $13,720"

  OLS is 3x TOO UNCERTAIN!
  You'd think predictions are useless when they're actually good.

FOR EXECUTIVES (30 years experience):
  OLS says: "Salary is $130,789 ± $39,728"
  Truth:    "Salary is $130,789 ± $127,400"

  OLS is 3x TOO CONFIDENT!
  You'd make promises you can't keep.

BUSINESS IMPACT:
  "We're 95% sure your executive hire will cost $91K-$170K"
  Reality: Could easily be $200K, $300K, or even $400K!

The Compounding Nightmare: Multiple Violations

Real data often violates MULTIPLE assumptions:

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 300

# Create data with MULTIPLE violations
x = np.linspace(1, 10, n)

# 1. NON-LINEAR relationship (violates linearity)
y_true = 10 * np.log(x) + 5

# 2. AUTOCORRELATED errors (violates independence)
errors = np.zeros(n)
errors[0] = np.random.normal()
for i in range(1, n):
    errors[i] = 0.7 * errors[i-1] + np.random.normal()

# 3. SKEWED error distribution (violates normality)
errors = errors + np.random.exponential(0.5, n)

# 4. HETEROSCEDASTIC (violates equal variance)
errors = errors * x * 0.3

y = y_true + errors

# Fit linear model
model = LinearRegression()
model.fit(x.reshape(-1, 1), y)
y_pred = model.predict(x.reshape(-1, 1))

r2 = r2_score(y, y_pred)

print("MULTIPLE VIOLATIONS: The Perfect Storm")
print("="*60)
print("\nViolations present:")
print("  ✗ Non-linear relationship (log curve)")
print("  ✗ Autocorrelated errors")
print("  ✗ Non-normal errors (skewed)")
print("  ✗ Heteroscedastic errors")
print(f"\nR² score: {r2:.3f}")
print("\n⚠️  R² looks DECENT!")
print("⚠️  But EVERY statistical inference is WRONG:")
print("    - Coefficients are biased")
print("    - Standard errors are wrong")
print("    - Confidence intervals are meaningless")
print("    - P-values are garbage")
print("    - Predictions will fail in production")

Output:

MULTIPLE VIOLATIONS: The Perfect Storm
============================================================

Violations present:
  ✗ Non-linear relationship (log curve)
  ✗ Autocorrelated errors
  ✗ Non-normal errors (skewed)
  ✗ Heteroscedastic errors

R² score: 0.847

⚠️  R² looks DECENT!
⚠️  But EVERY statistical inference is WRONG:
    - Coefficients are biased
    - Standard errors are wrong
    - Confidence intervals are meaningless
    - P-values are garbage
    - Predictions will fail in production

The Severity Guide

WHICH VIOLATIONS MATTER MOST?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

VIOLATION          COEFFICIENTS    STD ERRORS    PREDICTIONS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
LINEARITY          BIASED! 🔴      Wrong         WRONG! 🔴
                   Fundamentally   (minor)       Systematic
                   wrong                         errors

INDEPENDENCE       Unbiased ✓      WRONG! 🔴     OK for point
                                   Too small     predictions

NORMALITY          Unbiased ✓      OK (large n)  Intervals
                                   Wrong (small) WRONG! 🟡

HOMOSCEDASTICITY   Inefficient 🟡  WRONG! 🔴     Varying
                   (not optimal)   Wrong sizes   uncertainty


SEVERITY RANKING:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1. LINEARITY      → Most severe! Predictions are wrong.
2. INDEPENDENCE   → Severe for inference (CIs, p-values)
3. HOMOSCEDAST.   → Moderate. Inference is unreliable.
4. NORMALITY      → Least severe (if n is large)

How to Know You're in Trouble

def comprehensive_violation_check(X, y, model):
    """
    Check all assumptions and return a danger assessment.
    """
    from scipy import stats
    import statsmodels.api as sm
    from statsmodels.stats.diagnostic import het_breuschpagan

    # Predictions and residuals
    X_arr = X.reshape(-1, 1) if X.ndim == 1 else X
    y_pred = model.predict(X_arr)
    residuals = y - y_pred

    issues = []

    # 1. Linearity (correlation between residuals² and predictions)
    corr, _ = stats.spearmanr(y_pred, residuals**2)
    if abs(corr) > 0.3:
        issues.append(("LINEARITY", "HIGH", 
                      f"Curved pattern in residuals (corr={corr:.2f})"))

    # 2. Independence (Durbin-Watson)
    dw = np.sum(np.diff(residuals)**2) / np.sum(residuals**2)
    if dw < 1.5 or dw > 2.5:
        issues.append(("INDEPENDENCE", "HIGH",
                      f"Autocorrelation detected (DW={dw:.2f})"))

    # 3. Normality (Shapiro-Wilk or D'Agostino)
    if len(residuals) <= 5000:
        _, p_norm = stats.shapiro(residuals)
    else:
        _, p_norm = stats.normaltest(residuals)

    skew = abs(stats.skew(residuals))
    if p_norm < 0.05 and skew > 1:
        issues.append(("NORMALITY", "MEDIUM",
                      f"Skewed residuals (skew={skew:.2f})"))
    elif p_norm < 0.05:
        issues.append(("NORMALITY", "LOW",
                      f"Non-normal but mild (p={p_norm:.4f})"))

    # 4. Homoscedasticity (Breusch-Pagan)
    X_const = sm.add_constant(X_arr)
    _, p_het, _, _ = het_breuschpagan(residuals, X_const)
    if p_het < 0.05:
        issues.append(("HOMOSCEDASTICITY", "MEDIUM",
                      f"Variance not constant (p={p_het:.4f})"))

    # Report
    print("="*70)
    print("ASSUMPTION VIOLATION ASSESSMENT")
    print("="*70)

    if not issues:
        print("\n✓ All assumptions appear satisfied!")
        print("  Your linear regression results should be reliable.")
    else:
        print(f"\n⚠️  Found {len(issues)} potential violation(s):\n")
        for assumption, severity, detail in issues:
            emoji = "🔴" if severity == "HIGH" else "🟡" if severity == "MEDIUM" else "🟢"
            print(f"  {emoji} {assumption} ({severity})")
            print(f"     {detail}\n")

        # Recommendations
        print("-"*70)
        print("RECOMMENDATIONS:")
        print("-"*70)

        for assumption, severity, _ in issues:
            if assumption == "LINEARITY":
                print("  • Try polynomial features or non-linear models")
            elif assumption == "INDEPENDENCE":
                print("  • Use time series models or robust standard errors")
            elif assumption == "NORMALITY":
                print("  • Transform Y (log, Box-Cox) or use bootstrap CIs")
            elif assumption == "HOMOSCEDASTICITY":
                print("  • Use weighted least squares or robust standard errors")

    return issues

# Usage
issues = comprehensive_violation_check(X, y, model)

Quick Reference: Violation → Consequence → Fix

Violation	What Breaks	Danger Level	Fix
Linearity	Predictions are systematically wrong	🔴 HIGH	Transform X, polynomial features, use non-linear models
Independence	Standard errors too small, false significance	🔴 HIGH	Time series models, clustered SE, robust SE
Normality	Confidence intervals wrong, invalid tests	🟡 MEDIUM	Transform Y, bootstrap, larger samples
Homoscedasticity	Standard errors wrong everywhere	🟡 MEDIUM	WLS, transform Y, robust SE

Key Takeaways

Good metrics can hide violated assumptions — R² of 0.9 means nothing if linearity is violated
Linearity violations are the worst — Your predictions are fundamentally wrong
Independence violations inflate confidence — You think you know more than you do
Normality matters less with large samples — Central Limit Theorem helps
Heteroscedasticity makes some predictions more uncertain than others — One-size CIs don't fit all
Multiple violations compound the problems — Check everything before trusting results
Always check residual plots — Most violations are visible
The model can look perfect and still be garbage — Patterson's bridge scored 97.5%

The One-Sentence Summary

Inspector Patterson rated the bridge 97.5% safe by measuring paint quality while ignoring foundation cracks — when you violate linear regression assumptions, your R² and p-values can look perfect while your predictions are fundamentally wrong, your confidence intervals are lies, and your model will fail catastrophically in production.

What's Next?

Now that you understand what goes wrong, you're ready for:

Robust Regression — When violations are unavoidable
Transformations — Fixing violations with log, sqrt, Box-Cox
Ridge and Lasso — When simple fixes aren't enough
Non-Linear Models — When linearity just won't hold

Follow me for the next article in this series!

Let's Connect!

If "good R² doesn't mean good model" finally clicked, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

What's the worst assumption violation you've seen in production? I once inherited a model with severe heteroscedasticity that gave $50K prediction intervals for executives. Actual variance was $200K+! 💸

The difference between a model that looks good and one that is good? Checking whether your bridge has foundation cracks, not just nice paint. Good metrics + violated assumptions = beautiful disaster waiting to happen.

Share this with someone who just checks R² and calls it a day. Their next model failure might be avoidable.

Happy diagnosing! 🔍

DEV Community

When Linear Regression Assumptions Are Violated: The Bridge Engineer Who Ignored the Cracks and Declared It Safe

The Bridge Inspector Who Only Checked the Paint

The Four Violations and Their Consequences

Violation 1: LINEARITY — Predictions Are Fundamentally Wrong

The Scenario

What Goes Wrong

The Catastrophic Mistake

Visual: What You're Missing

Violation 2: INDEPENDENCE — False Confidence

The Scenario

What This Means

The Confidence Interval Lie

Violation 3: NORMALITY — Broken Inference

The Scenario

The Risk Underestimation

Violation 4: HOMOSCEDASTICITY — Wrong Uncertainty Everywhere

The Scenario

The Dangerous Implications

The Compounding Nightmare: Multiple Violations

The Severity Guide

How to Know You're in Trouble

Quick Reference: Violation → Consequence → Fix

Key Takeaways

The One-Sentence Summary

What's Next?

Let's Connect!

Top comments (0)