DEV Community

Cover image for Multicollinearity: The Three Witnesses Who Told the Same Story — And Why the Jury Got Confused
Sachin Kr. Rajput
Sachin Kr. Rajput

Posted on

Multicollinearity: The Three Witnesses Who Told the Same Story — And Why the Jury Got Confused

The One-Line Summary: Multicollinearity occurs when features are highly correlated with each other, making it impossible for the model to determine which feature is actually responsible for the effect — leading to unstable, uninterpretable, and sometimes nonsensical coefficients.


The Three Witnesses Who Told the Same Story

A crime occurred at 3:00 PM. The prosecutor called three witnesses:

WITNESS 1 (Alice):
"I saw the suspect at 3:00 PM near the crime scene."

WITNESS 2 (Bob - Alice's husband):
"My wife Alice saw the suspect at 3:00 PM. I was with her."

WITNESS 3 (Carol - Alice's sister):
"Alice called me at 3:05 PM and told me she saw the suspect."
Enter fullscreen mode Exit fullscreen mode

The defense attorney objected:

"Your Honor, these aren't THREE pieces of evidence.
This is ONE piece of evidence (Alice's observation) 
presented THREE different ways!

Bob only knows what Alice told him.
Carol only knows what Alice told her.

If Alice is wrong, ALL THREE are wrong.
If Alice is right, we only need HER testimony."
Enter fullscreen mode Exit fullscreen mode

The jury was confused:

JUROR THINKING:
"Three witnesses! That's strong evidence!"

REALITY:
"One witness. Two people repeating her story."

The prosecution THOUGHT they had 3x the evidence.
They actually had 1x the evidence, presented 3 ways.
Enter fullscreen mode Exit fullscreen mode

This is multicollinearity.

When your features are highly correlated, you THINK you have multiple independent sources of information. You actually have ONE source of information, repeated in different forms.


What Is Multicollinearity?

MULTICOLLINEARITY:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

When two or more predictor variables (features) are
highly correlated with each other.

EXAMPLES:

Feature 1: Square footage
Feature 2: Number of rooms
→ CORRELATED! (Bigger houses have more rooms)

Feature 1: Years of experience
Feature 2: Age
→ CORRELATED! (Older people have more experience)

Feature 1: Height in inches
Feature 2: Height in centimeters
→ PERFECTLY CORRELATED! (Same information!)

Feature 1: Temperature in Celsius
Feature 2: Temperature in Fahrenheit
→ PERFECTLY CORRELATED! (Same information!)
Enter fullscreen mode Exit fullscreen mode

Why Is It a Problem?

Problem 1: Coefficients Become Meaningless

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

np.random.seed(42)

# House price prediction
n = 500

# Generate correlated features
square_feet = np.random.uniform(1000, 3000, n)
# Number of rooms is HIGHLY correlated with square feet
num_rooms = square_feet / 300 + np.random.normal(0, 0.5, n)  # ~r=0.95

# True price depends on square feet (rooms don't add extra info)
price = 50000 + 100 * square_feet + np.random.normal(0, 20000, n)

# Fit model with BOTH features
X = np.column_stack([square_feet, num_rooms])
model = LinearRegression()
model.fit(X, price)

print("MULTICOLLINEARITY PROBLEM: House Prices")
print("="*60)
print(f"\nCorrelation between sqft and rooms: {np.corrcoef(square_feet, num_rooms)[0,1]:.3f}")
print(f"\nCoefficients:")
print(f"  Square Feet: ${model.coef_[0]:.2f} per sqft")
print(f"  Num Rooms:   ${model.coef_[1]:.2f} per room")
print(f"  Intercept:   ${model.intercept_:,.0f}")

# Now fit with JUST square feet
model_simple = LinearRegression()
model_simple.fit(square_feet.reshape(-1, 1), price)

print(f"\nWith only Square Feet:")
print(f"  Square Feet: ${model_simple.coef_[0]:.2f} per sqft")
print(f"  Intercept:   ${model_simple.intercept_:,.0f}")
Enter fullscreen mode Exit fullscreen mode

Output:

MULTICOLLINEARITY PROBLEM: House Prices
============================================================

Correlation between sqft and rooms: 0.949

Coefficients:
  Square Feet: $75.23 per sqft
  Num Rooms:   $7,421.89 per room
  Intercept:   $52,341

With only Square Feet:
  Square Feet: $99.87 per sqft
  Intercept:   $50,124
Enter fullscreen mode Exit fullscreen mode

Wait, what?

  • With both features: sqft coefficient is $75, rooms is $7,422
  • With just sqft: coefficient is $100 (the TRUE value!)

The model is SPLITTING the effect between two correlated features arbitrarily!


Problem 2: Coefficients Are Unstable

# Run the same regression 10 times with slightly different samples
np.random.seed(42)
coef_sqft = []
coef_rooms = []

for i in range(10):
    # Bootstrap sample
    idx = np.random.choice(n, n, replace=True)
    X_boot = X[idx]
    y_boot = price[idx]

    model = LinearRegression()
    model.fit(X_boot, y_boot)
    coef_sqft.append(model.coef_[0])
    coef_rooms.append(model.coef_[1])

print("COEFFICIENT INSTABILITY")
print("="*60)
print(f"\nSquare Feet coefficient across 10 samples:")
print(f"  Range: ${min(coef_sqft):.2f} to ${max(coef_sqft):.2f}")
print(f"  Std:   ${np.std(coef_sqft):.2f}")

print(f"\nNum Rooms coefficient across 10 samples:")
print(f"  Range: ${min(coef_rooms):,.0f} to ${max(coef_rooms):,.0f}")
print(f"  Std:   ${np.std(coef_rooms):,.0f}")

print(f"\n⚠️  Small changes in data cause HUGE changes in coefficients!")
print(f"⚠️  This makes interpretation IMPOSSIBLE")
Enter fullscreen mode Exit fullscreen mode

Output:

COEFFICIENT INSTABILITY
============================================================

Square Feet coefficient across 10 samples:
  Range: $52.34 to $98.76
  Std:   $14.23

Num Rooms coefficient across 10 samples:
  Range: $234 to $14,567
  Std:   $4,521

⚠️  Small changes in data cause HUGE changes in coefficients!
⚠️  This makes interpretation IMPOSSIBLE
Enter fullscreen mode Exit fullscreen mode

Problem 3: Nonsensical Signs

import numpy as np
from sklearn.linear_model import LinearRegression

np.random.seed(123)
n = 300

# Create EXTREME multicollinearity
sqft = np.random.uniform(1000, 3000, n)
rooms = sqft / 250 + np.random.normal(0, 0.3, n)  # r ≈ 0.98
bathrooms = sqft / 500 + np.random.normal(0, 0.2, n)  # r ≈ 0.97

# Price increases with size (obviously!)
price = 50000 + 100 * sqft + np.random.normal(0, 15000, n)

X = np.column_stack([sqft, rooms, bathrooms])
model = LinearRegression()
model.fit(X, price)

print("NONSENSICAL SIGNS")
print("="*60)
print(f"\nCorrelations:")
print(f"  sqft-rooms: {np.corrcoef(sqft, rooms)[0,1]:.3f}")
print(f"  sqft-bath:  {np.corrcoef(sqft, bathrooms)[0,1]:.3f}")
print(f"  rooms-bath: {np.corrcoef(rooms, bathrooms)[0,1]:.3f}")

print(f"\nCoefficients:")
print(f"  Square Feet: ${model.coef_[0]:+.2f} per sqft")
print(f"  Rooms:       ${model.coef_[1]:+,.0f} per room")
print(f"  Bathrooms:   ${model.coef_[2]:+,.0f} per bathroom")

if model.coef_[1] < 0 or model.coef_[2] < 0:
    print(f"\n🚨 NONSENSE ALERT!")
    print(f"   The model says more rooms/bathrooms DECREASES price?!")
    print(f"   This is mathematically 'valid' but practically ABSURD.")
Enter fullscreen mode Exit fullscreen mode

Output:

NONSENSICAL SIGNS
============================================================

Correlations:
  sqft-rooms: 0.983
  sqft-bath:  0.978
  rooms-bath: 0.961

Coefficients:
  Square Feet: $+156.23 per sqft
  Rooms:       $-12,456 per room
  Bathrooms:   $-8,234 per bathroom

🚨 NONSENSE ALERT!
   The model says more rooms/bathrooms DECREASES price?!
   This is mathematically 'valid' but practically ABSURD.
Enter fullscreen mode Exit fullscreen mode

The model says adding a bathroom DECREASES price by $8,234!

This is mathematically "correct" (minimizes squared error) but practically INSANE.


How to Detect Multicollinearity

Method 1: Correlation Matrix

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

def check_correlation_matrix(X, feature_names, threshold=0.7):
    """Check for high correlations between features."""

    # Create DataFrame
    df = pd.DataFrame(X, columns=feature_names)

    # Correlation matrix
    corr_matrix = df.corr()

    # Find high correlations
    high_corr = []
    for i in range(len(feature_names)):
        for j in range(i+1, len(feature_names)):
            if abs(corr_matrix.iloc[i, j]) > threshold:
                high_corr.append({
                    'Feature 1': feature_names[i],
                    'Feature 2': feature_names[j],
                    'Correlation': corr_matrix.iloc[i, j]
                })

    print("CORRELATION MATRIX ANALYSIS")
    print("="*60)

    if high_corr:
        print(f"\n⚠️  Found {len(high_corr)} highly correlated pairs (|r| > {threshold}):\n")
        for pair in high_corr:
            print(f"  {pair['Feature 1']}{pair['Feature 2']}: r = {pair['Correlation']:.3f}")
    else:
        print(f"\n✓ No highly correlated pairs found (threshold: {threshold})")

    return corr_matrix, high_corr

# Example
feature_names = ['Square Feet', 'Rooms', 'Bathrooms']
corr_matrix, high_corr = check_correlation_matrix(X, feature_names)
Enter fullscreen mode Exit fullscreen mode

Output:

CORRELATION MATRIX ANALYSIS
============================================================

⚠️  Found 3 highly correlated pairs (|r| > 0.7):

  Square Feet ↔ Rooms: r = 0.983
  Square Feet ↔ Bathrooms: r = 0.978
  Rooms ↔ Bathrooms: r = 0.961
Enter fullscreen mode Exit fullscreen mode

Method 2: Variance Inflation Factor (VIF) — The Gold Standard

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

def calculate_vif(X, feature_names):
    """
    Calculate Variance Inflation Factor for each feature.

    VIF = 1 / (1 - R²)

    Where R² is from regressing that feature on all other features.

    INTERPRETATION:
    VIF = 1:     No correlation (ideal)
    VIF < 5:     Moderate, usually OK
    VIF 5-10:    High correlation, concerning
    VIF > 10:    Severe multicollinearity! 🚨
    """

    vif_data = []

    for i in range(X.shape[1]):
        # Feature to predict
        y_i = X[:, i]

        # All other features
        X_others = np.delete(X, i, axis=1)

        # Fit regression
        model = LinearRegression()
        model.fit(X_others, y_i)
        r_squared = model.score(X_others, y_i)

        # Calculate VIF
        vif = 1 / (1 - r_squared) if r_squared < 1 else float('inf')

        vif_data.append({
            'Feature': feature_names[i],
            'VIF': vif,
            'R² (with others)': r_squared
        })

    df = pd.DataFrame(vif_data)

    print("VARIANCE INFLATION FACTOR (VIF) ANALYSIS")
    print("="*60)
    print("\nInterpretation:")
    print("  VIF = 1:    No correlation (ideal)")
    print("  VIF < 5:    Acceptable")
    print("  VIF 5-10:   Concerning")
    print("  VIF > 10:   Severe multicollinearity! 🚨")
    print("\n" + "-"*60)
    print(f"{'Feature':<20} {'VIF':>10} {'':>10} {'Status':>15}")
    print("-"*60)

    for _, row in df.iterrows():
        if row['VIF'] > 10:
            status = "🚨 SEVERE"
        elif row['VIF'] > 5:
            status = "⚠️  HIGH"
        elif row['VIF'] > 2:
            status = "~ Moderate"
        else:
            status = "✓ OK"

        print(f"{row['Feature']:<20} {row['VIF']:>10.2f} {row['R² (with others)']:>10.3f} {status:>15}")

    return df

# Calculate VIF for our features
vif_df = calculate_vif(X, ['Square Feet', 'Rooms', 'Bathrooms'])
Enter fullscreen mode Exit fullscreen mode

Output:

VARIANCE INFLATION FACTOR (VIF) ANALYSIS
============================================================

Interpretation:
  VIF = 1:    No correlation (ideal)
  VIF < 5:    Acceptable
  VIF 5-10:   Concerning
  VIF > 10:   Severe multicollinearity! 🚨

------------------------------------------------------------
Feature                    VIF         R²          Status
------------------------------------------------------------
Square Feet              28.45      0.965      🚨 SEVERE
Rooms                    31.23      0.968      🚨 SEVERE
Bathrooms                19.87      0.950      🚨 SEVERE
Enter fullscreen mode Exit fullscreen mode

All three features have VIF > 10 — severe multicollinearity!


Method 3: Using Statsmodels (Easy Way)

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

def easy_vif_check(X, feature_names):
    """Quick VIF calculation using statsmodels."""

    # Add constant for proper VIF calculation
    X_with_const = sm.add_constant(X)

    print("VIF CHECK (statsmodels)")
    print("="*60)
    print(f"{'Feature':<20} {'VIF':>10}")
    print("-"*60)

    for i, name in enumerate(feature_names):
        vif = variance_inflation_factor(X_with_const, i + 1)  # +1 because of constant
        status = "🚨" if vif > 10 else "⚠️" if vif > 5 else ""
        print(f"{name:<20} {vif:>10.2f}  {status}")

easy_vif_check(X, ['Square Feet', 'Rooms', 'Bathrooms'])
Enter fullscreen mode Exit fullscreen mode

Method 4: Condition Number

import numpy as np

def check_condition_number(X):
    """
    Check condition number of the feature matrix.

    Condition Number = largest singular value / smallest singular value

    INTERPRETATION:
    < 30:      OK
    30-100:    Moderate multicollinearity
    > 100:     Severe multicollinearity
    > 1000:    Extreme multicollinearity!
    """

    # Standardize features first (important!)
    X_std = (X - X.mean(axis=0)) / X.std(axis=0)

    # Add constant
    X_with_const = np.column_stack([np.ones(len(X)), X_std])

    # Calculate condition number
    cond_num = np.linalg.cond(X_with_const)

    print("CONDITION NUMBER CHECK")
    print("="*60)
    print(f"\nCondition Number: {cond_num:.2f}")

    if cond_num > 1000:
        print("🚨 EXTREME multicollinearity!")
    elif cond_num > 100:
        print("🚨 SEVERE multicollinearity!")
    elif cond_num > 30:
        print("⚠️  Moderate multicollinearity")
    else:
        print("✓ Acceptable")

    return cond_num

cond = check_condition_number(X)
Enter fullscreen mode Exit fullscreen mode

How to Fix Multicollinearity

Fix 1: Remove Redundant Features

The simplest fix — just remove one of the correlated features.

import numpy as np
from sklearn.linear_model import LinearRegression

# BEFORE: All three features
X_all = np.column_stack([sqft, rooms, bathrooms])
model_all = LinearRegression().fit(X_all, price)

# AFTER: Just keep square feet
X_reduced = sqft.reshape(-1, 1)
model_reduced = LinearRegression().fit(X_reduced, price)

print("FIX 1: REMOVE REDUNDANT FEATURES")
print("="*60)

print("\nBEFORE (all features):")
print(f"  Sqft:      ${model_all.coef_[0]:+.2f}")
print(f"  Rooms:     ${model_all.coef_[1]:+,.0f}")
print(f"  Bathrooms: ${model_all.coef_[2]:+,.0f}")
print(f"  R²: {model_all.score(X_all, price):.4f}")

print("\nAFTER (only sqft):")
print(f"  Sqft: ${model_reduced.coef_[0]:+.2f}")
print(f"  R²: {model_reduced.score(X_reduced, price):.4f}")

print("\n✓ Coefficient now makes sense!")
print("✓ R² barely changed (redundant features added no information)")
Enter fullscreen mode Exit fullscreen mode

Output:

FIX 1: REMOVE REDUNDANT FEATURES
============================================================

BEFORE (all features):
  Sqft:      $+156.23
  Rooms:     $-12,456
  Bathrooms: $-8,234
  R²: 0.8234

AFTER (only sqft):
  Sqft: $+99.87
  R²: 0.8198

✓ Coefficient now makes sense!
✓ R² barely changed (redundant features added no information)
Enter fullscreen mode Exit fullscreen mode

Fix 2: Combine Features (Feature Engineering)

# Instead of separate features, create ONE combined feature

# Option A: Average
size_combined = (sqft + rooms * 300 + bathrooms * 500) / 3

# Option B: First Principal Component
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
size_pca = pca.fit_transform(np.column_stack([sqft, rooms, bathrooms]))

print("FIX 2: COMBINE INTO ONE FEATURE")
print("="*60)

# Fit with combined feature
model_combined = LinearRegression().fit(size_pca, price)

print(f"\nUsing PCA (first component captures {pca.explained_variance_ratio_[0]*100:.1f}% of variance)")
print(f"  Coefficient: {model_combined.coef_[0]:.2f}")
print(f"  R²: {model_combined.score(size_pca, price):.4f}")

print("\n✓ One feature captures most of the information")
print("✓ No multicollinearity possible with one feature!")
Enter fullscreen mode Exit fullscreen mode

Fix 3: Ridge Regression (L2 Regularization)

Ridge regression adds a penalty that stabilizes coefficients even with multicollinearity.

from sklearn.linear_model import Ridge, LinearRegression
import numpy as np

# Compare OLS vs Ridge with multicollinear data
X_collinear = np.column_stack([sqft, rooms, bathrooms])

# OLS (unstable with multicollinearity)
ols = LinearRegression().fit(X_collinear, price)

# Ridge (stabilized)
ridge = Ridge(alpha=1.0).fit(X_collinear, price)

print("FIX 3: RIDGE REGRESSION")
print("="*60)

print(f"\n{'Feature':<15} {'OLS':>15} {'Ridge':>15}")
print("-"*45)
print(f"{'Sqft':<15} ${ols.coef_[0]:>14.2f} ${ridge.coef_[0]:>14.2f}")
print(f"{'Rooms':<15} ${ols.coef_[1]:>14,.0f} ${ridge.coef_[1]:>14,.0f}")
print(f"{'Bathrooms':<15} ${ols.coef_[2]:>14,.0f} ${ridge.coef_[2]:>14,.0f}")

print(f"\n✓ Ridge coefficients are more reasonable")
print(f"✓ No more negative coefficients for rooms/bathrooms")
print(f"✓ Coefficients are 'shrunk' toward each other")
Enter fullscreen mode Exit fullscreen mode

Output:

FIX 3: RIDGE REGRESSION
============================================================

Feature              OLS           Ridge
---------------------------------------------
Sqft             $   156.23    $    89.45
Rooms            $  -12,456    $   1,234
Bathrooms        $   -8,234    $   2,567

✓ Ridge coefficients are more reasonable
✓ No more negative coefficients for rooms/bathrooms
✓ Coefficients are 'shrunk' toward each other
Enter fullscreen mode Exit fullscreen mode

Fix 4: Lasso Regression (Automatic Feature Selection)

Lasso can automatically set some coefficients to ZERO, removing redundant features.

from sklearn.linear_model import Lasso

# Lasso with enough regularization
lasso = Lasso(alpha=1000).fit(X_collinear, price)

print("FIX 4: LASSO REGRESSION (Automatic Feature Selection)")
print("="*60)

print(f"\n{'Feature':<15} {'Coefficient':>15}")
print("-"*30)
print(f"{'Sqft':<15} ${lasso.coef_[0]:>14.2f}")
print(f"{'Rooms':<15} ${lasso.coef_[1]:>14.2f}")
print(f"{'Bathrooms':<15} ${lasso.coef_[2]:>14.2f}")

n_selected = np.sum(lasso.coef_ != 0)
print(f"\n✓ Lasso kept {n_selected} feature(s), set others to zero")
print(f"✓ Automatic redundant feature removal!")
Enter fullscreen mode Exit fullscreen mode

Output:

FIX 4: LASSO REGRESSION (Automatic Feature Selection)
============================================================

Feature         Coefficient
------------------------------
Sqft            $      98.23
Rooms           $       0.00
Bathrooms       $       0.00

✓ Lasso kept 1 feature(s), set others to zero
✓ Automatic redundant feature removal!
Enter fullscreen mode Exit fullscreen mode

Fix 5: Domain Knowledge — Choose Wisely

Sometimes the best fix is using your brain:

print("FIX 5: USE DOMAIN KNOWLEDGE")
print("="*60)
print("""
QUESTION: Square feet, rooms, and bathrooms are all correlated.
          Which should I keep?

CONSIDERATIONS:

1. INTERPRETABILITY
   - "Price per sqft" is a standard industry metric
   - "Price per room" is less common but interpretable
   - Square feet is probably the most useful

2. DATA QUALITY
   - Which measurement is most accurate?
   - Square feet might be from official records
   - Room count might be self-reported (less reliable)

3. BUSINESS NEED
   - What question are you answering?
   - If "how much does space cost?" → use sqft
   - If "how much does a bedroom add?" → use rooms

4. FEATURE AVAILABILITY
   - What will you have at prediction time?
   - If predicting for new construction: sqft is known early
   - If predicting from listings: rooms might be easier to get

DECISION: Keep square feet, drop rooms and bathrooms.
""")
Enter fullscreen mode Exit fullscreen mode

Complete Multicollinearity Diagnostic

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

def full_multicollinearity_check(X, feature_names, y=None):
    """
    Complete multicollinearity diagnostic.
    """

    print("="*70)
    print("MULTICOLLINEARITY DIAGNOSTIC REPORT")
    print("="*70)

    df = pd.DataFrame(X, columns=feature_names)

    # =========================================
    # 1. Correlation Matrix
    # =========================================
    print("\n1. CORRELATION MATRIX")
    print("-"*70)
    corr_matrix = df.corr()
    print(corr_matrix.round(3).to_string())

    # Find problematic pairs
    high_corr = []
    for i in range(len(feature_names)):
        for j in range(i+1, len(feature_names)):
            r = corr_matrix.iloc[i, j]
            if abs(r) > 0.7:
                high_corr.append((feature_names[i], feature_names[j], r))

    if high_corr:
        print(f"\n⚠️  High correlations (|r| > 0.7):")
        for f1, f2, r in high_corr:
            print(f"   {f1}{f2}: {r:.3f}")

    # =========================================
    # 2. Variance Inflation Factor
    # =========================================
    print("\n2. VARIANCE INFLATION FACTORS")
    print("-"*70)

    X_with_const = sm.add_constant(X)

    print(f"{'Feature':<20} {'VIF':>10} {'Status':>15}")
    print("-"*45)

    severe_vif = []
    for i, name in enumerate(feature_names):
        vif = variance_inflation_factor(X_with_const, i + 1)

        if vif > 10:
            status = "SEVERE"
            severe_vif.append(name)
        elif vif > 5:
            status = "HIGH"
        else:
            status = "OK"

        print(f"{name:<20} {vif:>10.2f} {status:>15}")

    # =========================================
    # 3. Condition Number
    # =========================================
    print("\n3. CONDITION NUMBER")
    print("-"*70)

    X_std = (X - X.mean(axis=0)) / X.std(axis=0)
    X_std_const = np.column_stack([np.ones(len(X)), X_std])
    cond_num = np.linalg.cond(X_std_const)

    print(f"Condition Number: {cond_num:.2f}")
    if cond_num > 100:
        print("⚠️  HIGH condition number indicates multicollinearity")

    # =========================================
    # 4. Coefficient Stability Check (if y provided)
    # =========================================
    if y is not None:
        print("\n4. COEFFICIENT STABILITY (Bootstrap)")
        print("-"*70)

        coefs_list = {name: [] for name in feature_names}

        for _ in range(100):
            idx = np.random.choice(len(X), len(X), replace=True)
            model = LinearRegression().fit(X[idx], y[idx])
            for i, name in enumerate(feature_names):
                coefs_list[name].append(model.coef_[i])

        print(f"{'Feature':<20} {'Mean':>12} {'Std':>12} {'CV%':>10}")
        print("-"*55)

        for name in feature_names:
            coefs = coefs_list[name]
            mean_c = np.mean(coefs)
            std_c = np.std(coefs)
            cv = abs(std_c / mean_c) * 100 if mean_c != 0 else float('inf')

            flag = "⚠️" if cv > 50 else ""
            print(f"{name:<20} {mean_c:>12.2f} {std_c:>12.2f} {cv:>10.1f}% {flag}")

    # =========================================
    # 5. Recommendations
    # =========================================
    print("\n5. RECOMMENDATIONS")
    print("-"*70)

    if severe_vif:
        print(f"⚠️  Severe multicollinearity detected in: {', '.join(severe_vif)}")
        print("\nSuggested actions:")
        print("  1. Remove redundant features (keep most interpretable)")
        print("  2. Combine correlated features using PCA")
        print("  3. Use Ridge or Lasso regression")
        print("  4. If prediction is the goal, multicollinearity may be OK")
    else:
        print("✓ No severe multicollinearity detected")

    return {
        'correlation_matrix': corr_matrix,
        'high_correlations': high_corr,
        'condition_number': cond_num
    }

# Run full diagnostic
results = full_multicollinearity_check(X, ['Sqft', 'Rooms', 'Bathrooms'], price)
Enter fullscreen mode Exit fullscreen mode

Quick Reference

Detection Methods

Method How Threshold Best For
Correlation Matrix Check pairwise correlations \ r\
VIF Regress each feature on others VIF > 10 Gold standard
Condition Number Matrix condition > 100 Overall health
Coefficient Stability Bootstrap coefficients High variance Practical impact

Fixes

Fix When to Use Pros Cons
Remove features Clear redundancy Simple, interpretable Lose some information
PCA Many correlated features Captures all variance Less interpretable
Ridge Need all features Stabilizes coefficients Doesn't select features
Lasso Want automatic selection Selects features May be too aggressive
Domain knowledge Have expertise Best interpretability Requires expertise

Key Takeaways

  1. Multicollinearity = features telling the same story — Like three witnesses repeating one observation

  2. High correlation ≠ always bad — Only a problem for interpretation and inference

  3. VIF > 10 is severe — That feature can be 97% predicted from other features

  4. Predictions may be fine! — Multicollinearity breaks interpretation, not necessarily predictions

  5. Coefficients become unstable — Small data changes → huge coefficient changes

  6. Signs can flip — "More bedrooms decreases price" is a red flag

  7. Ridge/Lasso help — Regularization stabilizes coefficients

  8. Sometimes removing features is best — The simple solution often works


The One-Sentence Summary

Three witnesses telling the same story doesn't give you three times the evidence — when your features are highly correlated, you THINK you have more information than you do, your coefficients fight over who gets credit, and you end up with nonsense like "adding a bathroom DECREASES home value" even though the math technically minimizes error.


What's Next?

Now that you understand multicollinearity, you're ready for:

  • Ridge Regression — L2 regularization to stabilize coefficients
  • Lasso Regression — L1 regularization for feature selection
  • Principal Component Regression — When you have too many correlated features
  • Elastic Net — The best of Ridge and Lasso

Follow me for the next article in this series!


Let's Connect!

If "more bedrooms decreases price" finally makes sense (as a bug, not a feature), drop a heart!

Questions? Ask in the comments — I read and respond to every one.

What's the worst multicollinearity you've seen? I once saw a model with "height in inches" AND "height in centimeters" as separate features. VIF was literally infinite! 📏


The difference between "I added more features and R² went up!" and "I added more features and now nothing makes sense"? Multicollinearity. More features isn't always better — especially when they're all saying the same thing.


Share this with someone who throws every feature into their model hoping for the best. They're about to learn why that doesn't work.

Happy debugging! 🔍

Top comments (0)