Sachin Kr. Rajput

Posted on Jan 22

Multicollinearity: The Three Witnesses Who Told the Same Story — And Why the Jury Got Confused

#python #machinelearning #datascience #beginners

The One-Line Summary: Multicollinearity occurs when features are highly correlated with each other, making it impossible for the model to determine which feature is actually responsible for the effect — leading to unstable, uninterpretable, and sometimes nonsensical coefficients.

The Three Witnesses Who Told the Same Story

A crime occurred at 3:00 PM. The prosecutor called three witnesses:

WITNESS 1 (Alice):
"I saw the suspect at 3:00 PM near the crime scene."

WITNESS 2 (Bob - Alice's husband):
"My wife Alice saw the suspect at 3:00 PM. I was with her."

WITNESS 3 (Carol - Alice's sister):
"Alice called me at 3:05 PM and told me she saw the suspect."

The defense attorney objected:

"Your Honor, these aren't THREE pieces of evidence.
This is ONE piece of evidence (Alice's observation) 
presented THREE different ways!

Bob only knows what Alice told him.
Carol only knows what Alice told her.

If Alice is wrong, ALL THREE are wrong.
If Alice is right, we only need HER testimony."

The jury was confused:

JUROR THINKING:
"Three witnesses! That's strong evidence!"

REALITY:
"One witness. Two people repeating her story."

The prosecution THOUGHT they had 3x the evidence.
They actually had 1x the evidence, presented 3 ways.

This is multicollinearity.

When your features are highly correlated, you THINK you have multiple independent sources of information. You actually have ONE source of information, repeated in different forms.

What Is Multicollinearity?

MULTICOLLINEARITY:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

When two or more predictor variables (features) are
highly correlated with each other.

EXAMPLES:

Feature 1: Square footage
Feature 2: Number of rooms
→ CORRELATED! (Bigger houses have more rooms)

Feature 1: Years of experience
Feature 2: Age
→ CORRELATED! (Older people have more experience)

Feature 1: Height in inches
Feature 2: Height in centimeters
→ PERFECTLY CORRELATED! (Same information!)

Feature 1: Temperature in Celsius
Feature 2: Temperature in Fahrenheit
→ PERFECTLY CORRELATED! (Same information!)

Why Is It a Problem?

Problem 1: Coefficients Become Meaningless

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

np.random.seed(42)

# House price prediction
n = 500

# Generate correlated features
square_feet = np.random.uniform(1000, 3000, n)
# Number of rooms is HIGHLY correlated with square feet
num_rooms = square_feet / 300 + np.random.normal(0, 0.5, n)  # ~r=0.95

# True price depends on square feet (rooms don't add extra info)
price = 50000 + 100 * square_feet + np.random.normal(0, 20000, n)

# Fit model with BOTH features
X = np.column_stack([square_feet, num_rooms])
model = LinearRegression()
model.fit(X, price)

print("MULTICOLLINEARITY PROBLEM: House Prices")
print("="*60)
print(f"\nCorrelation between sqft and rooms: {np.corrcoef(square_feet, num_rooms)[0,1]:.3f}")
print(f"\nCoefficients:")
print(f"  Square Feet: ${model.coef_[0]:.2f} per sqft")
print(f"  Num Rooms:   ${model.coef_[1]:.2f} per room")
print(f"  Intercept:   ${model.intercept_:,.0f}")

# Now fit with JUST square feet
model_simple = LinearRegression()
model_simple.fit(square_feet.reshape(-1, 1), price)

print(f"\nWith only Square Feet:")
print(f"  Square Feet: ${model_simple.coef_[0]:.2f} per sqft")
print(f"  Intercept:   ${model_simple.intercept_:,.0f}")

Output:

MULTICOLLINEARITY PROBLEM: House Prices
============================================================

Correlation between sqft and rooms: 0.949

Coefficients:
  Square Feet: $75.23 per sqft
  Num Rooms:   $7,421.89 per room
  Intercept:   $52,341

With only Square Feet:
  Square Feet: $99.87 per sqft
  Intercept:   $50,124

Wait, what?

With both features: sqft coefficient is $75, rooms is $7,422
With just sqft: coefficient is $100 (the TRUE value!)

The model is SPLITTING the effect between two correlated features arbitrarily!

Problem 2: Coefficients Are Unstable

# Run the same regression 10 times with slightly different samples
np.random.seed(42)
coef_sqft = []
coef_rooms = []

for i in range(10):
    # Bootstrap sample
    idx = np.random.choice(n, n, replace=True)
    X_boot = X[idx]
    y_boot = price[idx]

    model = LinearRegression()
    model.fit(X_boot, y_boot)
    coef_sqft.append(model.coef_[0])
    coef_rooms.append(model.coef_[1])

print("COEFFICIENT INSTABILITY")
print("="*60)
print(f"\nSquare Feet coefficient across 10 samples:")
print(f"  Range: ${min(coef_sqft):.2f} to ${max(coef_sqft):.2f}")
print(f"  Std:   ${np.std(coef_sqft):.2f}")

print(f"\nNum Rooms coefficient across 10 samples:")
print(f"  Range: ${min(coef_rooms):,.0f} to ${max(coef_rooms):,.0f}")
print(f"  Std:   ${np.std(coef_rooms):,.0f}")

print(f"\n⚠️  Small changes in data cause HUGE changes in coefficients!")
print(f"⚠️  This makes interpretation IMPOSSIBLE")

Output:

COEFFICIENT INSTABILITY
============================================================

Square Feet coefficient across 10 samples:
  Range: $52.34 to $98.76
  Std:   $14.23

Num Rooms coefficient across 10 samples:
  Range: $234 to $14,567
  Std:   $4,521

⚠️  Small changes in data cause HUGE changes in coefficients!
⚠️  This makes interpretation IMPOSSIBLE

Problem 3: Nonsensical Signs

import numpy as np
from sklearn.linear_model import LinearRegression

np.random.seed(123)
n = 300

# Create EXTREME multicollinearity
sqft = np.random.uniform(1000, 3000, n)
rooms = sqft / 250 + np.random.normal(0, 0.3, n)  # r ≈ 0.98
bathrooms = sqft / 500 + np.random.normal(0, 0.2, n)  # r ≈ 0.97

# Price increases with size (obviously!)
price = 50000 + 100 * sqft + np.random.normal(0, 15000, n)

X = np.column_stack([sqft, rooms, bathrooms])
model = LinearRegression()
model.fit(X, price)

print("NONSENSICAL SIGNS")
print("="*60)
print(f"\nCorrelations:")
print(f"  sqft-rooms: {np.corrcoef(sqft, rooms)[0,1]:.3f}")
print(f"  sqft-bath:  {np.corrcoef(sqft, bathrooms)[0,1]:.3f}")
print(f"  rooms-bath: {np.corrcoef(rooms, bathrooms)[0,1]:.3f}")

print(f"\nCoefficients:")
print(f"  Square Feet: ${model.coef_[0]:+.2f} per sqft")
print(f"  Rooms:       ${model.coef_[1]:+,.0f} per room")
print(f"  Bathrooms:   ${model.coef_[2]:+,.0f} per bathroom")

if model.coef_[1] < 0 or model.coef_[2] < 0:
    print(f"\n🚨 NONSENSE ALERT!")
    print(f"   The model says more rooms/bathrooms DECREASES price?!")
    print(f"   This is mathematically 'valid' but practically ABSURD.")

Output:

NONSENSICAL SIGNS
============================================================

Correlations:
  sqft-rooms: 0.983
  sqft-bath:  0.978
  rooms-bath: 0.961

Coefficients:
  Square Feet: $+156.23 per sqft
  Rooms:       $-12,456 per room
  Bathrooms:   $-8,234 per bathroom

🚨 NONSENSE ALERT!
   The model says more rooms/bathrooms DECREASES price?!
   This is mathematically 'valid' but practically ABSURD.

The model says adding a bathroom DECREASES price by $8,234!

This is mathematically "correct" (minimizes squared error) but practically INSANE.

How to Detect Multicollinearity

Method 1: Correlation Matrix

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

def check_correlation_matrix(X, feature_names, threshold=0.7):
    """Check for high correlations between features."""

    # Create DataFrame
    df = pd.DataFrame(X, columns=feature_names)

    # Correlation matrix
    corr_matrix = df.corr()

    # Find high correlations
    high_corr = []
    for i in range(len(feature_names)):
        for j in range(i+1, len(feature_names)):
            if abs(corr_matrix.iloc[i, j]) > threshold:
                high_corr.append({
                    'Feature 1': feature_names[i],
                    'Feature 2': feature_names[j],
                    'Correlation': corr_matrix.iloc[i, j]
                })

    print("CORRELATION MATRIX ANALYSIS")
    print("="*60)

    if high_corr:
        print(f"\n⚠️  Found {len(high_corr)} highly correlated pairs (|r| > {threshold}):\n")
        for pair in high_corr:
            print(f"  {pair['Feature 1']} ↔ {pair['Feature 2']}: r = {pair['Correlation']:.3f}")
    else:
        print(f"\n✓ No highly correlated pairs found (threshold: {threshold})")

    return corr_matrix, high_corr

# Example
feature_names = ['Square Feet', 'Rooms', 'Bathrooms']
corr_matrix, high_corr = check_correlation_matrix(X, feature_names)

Output:

CORRELATION MATRIX ANALYSIS
============================================================

⚠️  Found 3 highly correlated pairs (|r| > 0.7):

  Square Feet ↔ Rooms: r = 0.983
  Square Feet ↔ Bathrooms: r = 0.978
  Rooms ↔ Bathrooms: r = 0.961

Method 2: Variance Inflation Factor (VIF) — The Gold Standard

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

def calculate_vif(X, feature_names):
    """
    Calculate Variance Inflation Factor for each feature.

    VIF = 1 / (1 - R²)

    Where R² is from regressing that feature on all other features.

    INTERPRETATION:
    VIF = 1:     No correlation (ideal)
    VIF < 5:     Moderate, usually OK
    VIF 5-10:    High correlation, concerning
    VIF > 10:    Severe multicollinearity! 🚨
    """

    vif_data = []

    for i in range(X.shape[1]):
        # Feature to predict
        y_i = X[:, i]

        # All other features
        X_others = np.delete(X, i, axis=1)

        # Fit regression
        model = LinearRegression()
        model.fit(X_others, y_i)
        r_squared = model.score(X_others, y_i)

        # Calculate VIF
        vif = 1 / (1 - r_squared) if r_squared < 1 else float('inf')

        vif_data.append({
            'Feature': feature_names[i],
            'VIF': vif,
            'R² (with others)': r_squared
        })

    df = pd.DataFrame(vif_data)

    print("VARIANCE INFLATION FACTOR (VIF) ANALYSIS")
    print("="*60)
    print("\nInterpretation:")
    print("  VIF = 1:    No correlation (ideal)")
    print("  VIF < 5:    Acceptable")
    print("  VIF 5-10:   Concerning")
    print("  VIF > 10:   Severe multicollinearity! 🚨")
    print("\n" + "-"*60)
    print(f"{'Feature':<20} {'VIF':>10} {'R²':>10} {'Status':>15}")
    print("-"*60)

    for _, row in df.iterrows():
        if row['VIF'] > 10:
            status = "🚨 SEVERE"
        elif row['VIF'] > 5:
            status = "⚠️  HIGH"
        elif row['VIF'] > 2:
            status = "~ Moderate"
        else:
            status = "✓ OK"

        print(f"{row['Feature']:<20} {row['VIF']:>10.2f} {row['R² (with others)']:>10.3f} {status:>15}")

    return df

# Calculate VIF for our features
vif_df = calculate_vif(X, ['Square Feet', 'Rooms', 'Bathrooms'])

Output:

VARIANCE INFLATION FACTOR (VIF) ANALYSIS
============================================================

Interpretation:
  VIF = 1:    No correlation (ideal)
  VIF < 5:    Acceptable
  VIF 5-10:   Concerning
  VIF > 10:   Severe multicollinearity! 🚨

------------------------------------------------------------
Feature                    VIF         R²          Status
------------------------------------------------------------
Square Feet              28.45      0.965      🚨 SEVERE
Rooms                    31.23      0.968      🚨 SEVERE
Bathrooms                19.87      0.950      🚨 SEVERE

All three features have VIF > 10 — severe multicollinearity!

Method 3: Using Statsmodels (Easy Way)

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

def easy_vif_check(X, feature_names):
    """Quick VIF calculation using statsmodels."""

    # Add constant for proper VIF calculation
    X_with_const = sm.add_constant(X)

    print("VIF CHECK (statsmodels)")
    print("="*60)
    print(f"{'Feature':<20} {'VIF':>10}")
    print("-"*60)

    for i, name in enumerate(feature_names):
        vif = variance_inflation_factor(X_with_const, i + 1)  # +1 because of constant
        status = "🚨" if vif > 10 else "⚠️" if vif > 5 else "✓"
        print(f"{name:<20} {vif:>10.2f}  {status}")

easy_vif_check(X, ['Square Feet', 'Rooms', 'Bathrooms'])

Method 4: Condition Number

import numpy as np

def check_condition_number(X):
    """
    Check condition number of the feature matrix.

    Condition Number = largest singular value / smallest singular value

    INTERPRETATION:
    < 30:      OK
    30-100:    Moderate multicollinearity
    > 100:     Severe multicollinearity
    > 1000:    Extreme multicollinearity!
    """

    # Standardize features first (important!)
    X_std = (X - X.mean(axis=0)) / X.std(axis=0)

    # Add constant
    X_with_const = np.column_stack([np.ones(len(X)), X_std])

    # Calculate condition number
    cond_num = np.linalg.cond(X_with_const)

    print("CONDITION NUMBER CHECK")
    print("="*60)
    print(f"\nCondition Number: {cond_num:.2f}")

    if cond_num > 1000:
        print("🚨 EXTREME multicollinearity!")
    elif cond_num > 100:
        print("🚨 SEVERE multicollinearity!")
    elif cond_num > 30:
        print("⚠️  Moderate multicollinearity")
    else:
        print("✓ Acceptable")

    return cond_num

cond = check_condition_number(X)

How to Fix Multicollinearity

Fix 1: Remove Redundant Features

The simplest fix — just remove one of the correlated features.

import numpy as np
from sklearn.linear_model import LinearRegression

# BEFORE: All three features
X_all = np.column_stack([sqft, rooms, bathrooms])
model_all = LinearRegression().fit(X_all, price)

# AFTER: Just keep square feet
X_reduced = sqft.reshape(-1, 1)
model_reduced = LinearRegression().fit(X_reduced, price)

print("FIX 1: REMOVE REDUNDANT FEATURES")
print("="*60)

print("\nBEFORE (all features):")
print(f"  Sqft:      ${model_all.coef_[0]:+.2f}")
print(f"  Rooms:     ${model_all.coef_[1]:+,.0f}")
print(f"  Bathrooms: ${model_all.coef_[2]:+,.0f}")
print(f"  R²: {model_all.score(X_all, price):.4f}")

print("\nAFTER (only sqft):")
print(f"  Sqft: ${model_reduced.coef_[0]:+.2f}")
print(f"  R²: {model_reduced.score(X_reduced, price):.4f}")

print("\n✓ Coefficient now makes sense!")
print("✓ R² barely changed (redundant features added no information)")

Output:

FIX 1: REMOVE REDUNDANT FEATURES
============================================================

BEFORE (all features):
  Sqft:      $+156.23
  Rooms:     $-12,456
  Bathrooms: $-8,234
  R²: 0.8234

AFTER (only sqft):
  Sqft: $+99.87
  R²: 0.8198

✓ Coefficient now makes sense!
✓ R² barely changed (redundant features added no information)

Fix 2: Combine Features (Feature Engineering)

# Instead of separate features, create ONE combined feature

# Option A: Average
size_combined = (sqft + rooms * 300 + bathrooms * 500) / 3

# Option B: First Principal Component
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
size_pca = pca.fit_transform(np.column_stack([sqft, rooms, bathrooms]))

print("FIX 2: COMBINE INTO ONE FEATURE")
print("="*60)

# Fit with combined feature
model_combined = LinearRegression().fit(size_pca, price)

print(f"\nUsing PCA (first component captures {pca.explained_variance_ratio_[0]*100:.1f}% of variance)")
print(f"  Coefficient: {model_combined.coef_[0]:.2f}")
print(f"  R²: {model_combined.score(size_pca, price):.4f}")

print("\n✓ One feature captures most of the information")
print("✓ No multicollinearity possible with one feature!")

Fix 3: Ridge Regression (L2 Regularization)

Ridge regression adds a penalty that stabilizes coefficients even with multicollinearity.

from sklearn.linear_model import Ridge, LinearRegression
import numpy as np

# Compare OLS vs Ridge with multicollinear data
X_collinear = np.column_stack([sqft, rooms, bathrooms])

# OLS (unstable with multicollinearity)
ols = LinearRegression().fit(X_collinear, price)

# Ridge (stabilized)
ridge = Ridge(alpha=1.0).fit(X_collinear, price)

print("FIX 3: RIDGE REGRESSION")
print("="*60)

print(f"\n{'Feature':<15} {'OLS':>15} {'Ridge':>15}")
print("-"*45)
print(f"{'Sqft':<15} ${ols.coef_[0]:>14.2f} ${ridge.coef_[0]:>14.2f}")
print(f"{'Rooms':<15} ${ols.coef_[1]:>14,.0f} ${ridge.coef_[1]:>14,.0f}")
print(f"{'Bathrooms':<15} ${ols.coef_[2]:>14,.0f} ${ridge.coef_[2]:>14,.0f}")

print(f"\n✓ Ridge coefficients are more reasonable")
print(f"✓ No more negative coefficients for rooms/bathrooms")
print(f"✓ Coefficients are 'shrunk' toward each other")

Output:

FIX 3: RIDGE REGRESSION
============================================================

Feature              OLS           Ridge
---------------------------------------------
Sqft             $   156.23    $    89.45
Rooms            $  -12,456    $   1,234
Bathrooms        $   -8,234    $   2,567

✓ Ridge coefficients are more reasonable
✓ No more negative coefficients for rooms/bathrooms
✓ Coefficients are 'shrunk' toward each other

Fix 4: Lasso Regression (Automatic Feature Selection)

Lasso can automatically set some coefficients to ZERO, removing redundant features.

from sklearn.linear_model import Lasso

# Lasso with enough regularization
lasso = Lasso(alpha=1000).fit(X_collinear, price)

print("FIX 4: LASSO REGRESSION (Automatic Feature Selection)")
print("="*60)

print(f"\n{'Feature':<15} {'Coefficient':>15}")
print("-"*30)
print(f"{'Sqft':<15} ${lasso.coef_[0]:>14.2f}")
print(f"{'Rooms':<15} ${lasso.coef_[1]:>14.2f}")
print(f"{'Bathrooms':<15} ${lasso.coef_[2]:>14.2f}")

n_selected = np.sum(lasso.coef_ != 0)
print(f"\n✓ Lasso kept {n_selected} feature(s), set others to zero")
print(f"✓ Automatic redundant feature removal!")

Output:

FIX 4: LASSO REGRESSION (Automatic Feature Selection)
============================================================

Feature         Coefficient
------------------------------
Sqft            $      98.23
Rooms           $       0.00
Bathrooms       $       0.00

✓ Lasso kept 1 feature(s), set others to zero
✓ Automatic redundant feature removal!

Fix 5: Domain Knowledge — Choose Wisely

Sometimes the best fix is using your brain:

print("FIX 5: USE DOMAIN KNOWLEDGE")
print("="*60)
print("""
QUESTION: Square feet, rooms, and bathrooms are all correlated.
          Which should I keep?

CONSIDERATIONS:

1. INTERPRETABILITY
   - "Price per sqft" is a standard industry metric
   - "Price per room" is less common but interpretable
   - Square feet is probably the most useful

2. DATA QUALITY
   - Which measurement is most accurate?
   - Square feet might be from official records
   - Room count might be self-reported (less reliable)

3. BUSINESS NEED
   - What question are you answering?
   - If "how much does space cost?" → use sqft
   - If "how much does a bedroom add?" → use rooms

4. FEATURE AVAILABILITY
   - What will you have at prediction time?
   - If predicting for new construction: sqft is known early
   - If predicting from listings: rooms might be easier to get

DECISION: Keep square feet, drop rooms and bathrooms.
""")

Complete Multicollinearity Diagnostic

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

def full_multicollinearity_check(X, feature_names, y=None):
    """
    Complete multicollinearity diagnostic.
    """

    print("="*70)
    print("MULTICOLLINEARITY DIAGNOSTIC REPORT")
    print("="*70)

    df = pd.DataFrame(X, columns=feature_names)

    # =========================================
    # 1. Correlation Matrix
    # =========================================
    print("\n1. CORRELATION MATRIX")
    print("-"*70)
    corr_matrix = df.corr()
    print(corr_matrix.round(3).to_string())

    # Find problematic pairs
    high_corr = []
    for i in range(len(feature_names)):
        for j in range(i+1, len(feature_names)):
            r = corr_matrix.iloc[i, j]
            if abs(r) > 0.7:
                high_corr.append((feature_names[i], feature_names[j], r))

    if high_corr:
        print(f"\n⚠️  High correlations (|r| > 0.7):")
        for f1, f2, r in high_corr:
            print(f"   {f1} ↔ {f2}: {r:.3f}")

    # =========================================
    # 2. Variance Inflation Factor
    # =========================================
    print("\n2. VARIANCE INFLATION FACTORS")
    print("-"*70)

    X_with_const = sm.add_constant(X)

    print(f"{'Feature':<20} {'VIF':>10} {'Status':>15}")
    print("-"*45)

    severe_vif = []
    for i, name in enumerate(feature_names):
        vif = variance_inflation_factor(X_with_const, i + 1)

        if vif > 10:
            status = "SEVERE"
            severe_vif.append(name)
        elif vif > 5:
            status = "HIGH"
        else:
            status = "OK"

        print(f"{name:<20} {vif:>10.2f} {status:>15}")

    # =========================================
    # 3. Condition Number
    # =========================================
    print("\n3. CONDITION NUMBER")
    print("-"*70)

    X_std = (X - X.mean(axis=0)) / X.std(axis=0)
    X_std_const = np.column_stack([np.ones(len(X)), X_std])
    cond_num = np.linalg.cond(X_std_const)

    print(f"Condition Number: {cond_num:.2f}")
    if cond_num > 100:
        print("⚠️  HIGH condition number indicates multicollinearity")

    # =========================================
    # 4. Coefficient Stability Check (if y provided)
    # =========================================
    if y is not None:
        print("\n4. COEFFICIENT STABILITY (Bootstrap)")
        print("-"*70)

        coefs_list = {name: [] for name in feature_names}

        for _ in range(100):
            idx = np.random.choice(len(X), len(X), replace=True)
            model = LinearRegression().fit(X[idx], y[idx])
            for i, name in enumerate(feature_names):
                coefs_list[name].append(model.coef_[i])

        print(f"{'Feature':<20} {'Mean':>12} {'Std':>12} {'CV%':>10}")
        print("-"*55)

        for name in feature_names:
            coefs = coefs_list[name]
            mean_c = np.mean(coefs)
            std_c = np.std(coefs)
            cv = abs(std_c / mean_c) * 100 if mean_c != 0 else float('inf')

            flag = "⚠️" if cv > 50 else ""
            print(f"{name:<20} {mean_c:>12.2f} {std_c:>12.2f} {cv:>10.1f}% {flag}")

    # =========================================
    # 5. Recommendations
    # =========================================
    print("\n5. RECOMMENDATIONS")
    print("-"*70)

    if severe_vif:
        print(f"⚠️  Severe multicollinearity detected in: {', '.join(severe_vif)}")
        print("\nSuggested actions:")
        print("  1. Remove redundant features (keep most interpretable)")
        print("  2. Combine correlated features using PCA")
        print("  3. Use Ridge or Lasso regression")
        print("  4. If prediction is the goal, multicollinearity may be OK")
    else:
        print("✓ No severe multicollinearity detected")

    return {
        'correlation_matrix': corr_matrix,
        'high_correlations': high_corr,
        'condition_number': cond_num
    }

# Run full diagnostic
results = full_multicollinearity_check(X, ['Sqft', 'Rooms', 'Bathrooms'], price)

Quick Reference

Detection Methods

Method	How	Threshold	Best For
Correlation Matrix	Check pairwise correlations	\	r\
VIF	Regress each feature on others	VIF > 10	Gold standard
Condition Number	Matrix condition	> 100	Overall health
Coefficient Stability	Bootstrap coefficients	High variance	Practical impact

Fixes

Fix	When to Use	Pros	Cons
Remove features	Clear redundancy	Simple, interpretable	Lose some information
PCA	Many correlated features	Captures all variance	Less interpretable
Ridge	Need all features	Stabilizes coefficients	Doesn't select features
Lasso	Want automatic selection	Selects features	May be too aggressive
Domain knowledge	Have expertise	Best interpretability	Requires expertise

Key Takeaways

Multicollinearity = features telling the same story — Like three witnesses repeating one observation
High correlation ≠ always bad — Only a problem for interpretation and inference
VIF > 10 is severe — That feature can be 97% predicted from other features
Predictions may be fine! — Multicollinearity breaks interpretation, not necessarily predictions
Coefficients become unstable — Small data changes → huge coefficient changes
Signs can flip — "More bedrooms decreases price" is a red flag
Ridge/Lasso help — Regularization stabilizes coefficients
Sometimes removing features is best — The simple solution often works

The One-Sentence Summary

Three witnesses telling the same story doesn't give you three times the evidence — when your features are highly correlated, you THINK you have more information than you do, your coefficients fight over who gets credit, and you end up with nonsense like "adding a bathroom DECREASES home value" even though the math technically minimizes error.

What's Next?

Now that you understand multicollinearity, you're ready for:

Ridge Regression — L2 regularization to stabilize coefficients
Lasso Regression — L1 regularization for feature selection
Principal Component Regression — When you have too many correlated features
Elastic Net — The best of Ridge and Lasso

Follow me for the next article in this series!

Let's Connect!

If "more bedrooms decreases price" finally makes sense (as a bug, not a feature), drop a heart!

Questions? Ask in the comments — I read and respond to every one.

What's the worst multicollinearity you've seen? I once saw a model with "height in inches" AND "height in centimeters" as separate features. VIF was literally infinite! 📏

The difference between "I added more features and R² went up!" and "I added more features and now nothing makes sense"? Multicollinearity. More features isn't always better — especially when they're all saying the same thing.

Share this with someone who throws every feature into their model hoping for the best. They're about to learn why that doesn't work.

Happy debugging! 🔍

DEV Community

Multicollinearity: The Three Witnesses Who Told the Same Story — And Why the Jury Got Confused

The Three Witnesses Who Told the Same Story

What Is Multicollinearity?

Why Is It a Problem?

Problem 1: Coefficients Become Meaningless

Problem 2: Coefficients Are Unstable

Problem 3: Nonsensical Signs

How to Detect Multicollinearity

Method 1: Correlation Matrix

Method 2: Variance Inflation Factor (VIF) — The Gold Standard

Method 3: Using Statsmodels (Easy Way)

Method 4: Condition Number

How to Fix Multicollinearity

Fix 1: Remove Redundant Features

Fix 2: Combine Features (Feature Engineering)

Fix 3: Ridge Regression (L2 Regularization)

Fix 4: Lasso Regression (Automatic Feature Selection)

Fix 5: Domain Knowledge — Choose Wisely

Complete Multicollinearity Diagnostic

Quick Reference

Detection Methods

Fixes

Key Takeaways

The One-Sentence Summary

What's Next?

Let's Connect!

Top comments (0)