The One-Line Summary: Multicollinearity occurs when features are highly correlated with each other, making it impossible for the model to determine which feature is actually responsible for the effect — leading to unstable, uninterpretable, and sometimes nonsensical coefficients.
The Three Witnesses Who Told the Same Story
A crime occurred at 3:00 PM. The prosecutor called three witnesses:
WITNESS 1 (Alice):
"I saw the suspect at 3:00 PM near the crime scene."
WITNESS 2 (Bob - Alice's husband):
"My wife Alice saw the suspect at 3:00 PM. I was with her."
WITNESS 3 (Carol - Alice's sister):
"Alice called me at 3:05 PM and told me she saw the suspect."
The defense attorney objected:
"Your Honor, these aren't THREE pieces of evidence.
This is ONE piece of evidence (Alice's observation)
presented THREE different ways!
Bob only knows what Alice told him.
Carol only knows what Alice told her.
If Alice is wrong, ALL THREE are wrong.
If Alice is right, we only need HER testimony."
The jury was confused:
JUROR THINKING:
"Three witnesses! That's strong evidence!"
REALITY:
"One witness. Two people repeating her story."
The prosecution THOUGHT they had 3x the evidence.
They actually had 1x the evidence, presented 3 ways.
This is multicollinearity.
When your features are highly correlated, you THINK you have multiple independent sources of information. You actually have ONE source of information, repeated in different forms.
What Is Multicollinearity?
MULTICOLLINEARITY:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
When two or more predictor variables (features) are
highly correlated with each other.
EXAMPLES:
Feature 1: Square footage
Feature 2: Number of rooms
→ CORRELATED! (Bigger houses have more rooms)
Feature 1: Years of experience
Feature 2: Age
→ CORRELATED! (Older people have more experience)
Feature 1: Height in inches
Feature 2: Height in centimeters
→ PERFECTLY CORRELATED! (Same information!)
Feature 1: Temperature in Celsius
Feature 2: Temperature in Fahrenheit
→ PERFECTLY CORRELATED! (Same information!)
Why Is It a Problem?
Problem 1: Coefficients Become Meaningless
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
np.random.seed(42)
# House price prediction
n = 500
# Generate correlated features
square_feet = np.random.uniform(1000, 3000, n)
# Number of rooms is HIGHLY correlated with square feet
num_rooms = square_feet / 300 + np.random.normal(0, 0.5, n) # ~r=0.95
# True price depends on square feet (rooms don't add extra info)
price = 50000 + 100 * square_feet + np.random.normal(0, 20000, n)
# Fit model with BOTH features
X = np.column_stack([square_feet, num_rooms])
model = LinearRegression()
model.fit(X, price)
print("MULTICOLLINEARITY PROBLEM: House Prices")
print("="*60)
print(f"\nCorrelation between sqft and rooms: {np.corrcoef(square_feet, num_rooms)[0,1]:.3f}")
print(f"\nCoefficients:")
print(f" Square Feet: ${model.coef_[0]:.2f} per sqft")
print(f" Num Rooms: ${model.coef_[1]:.2f} per room")
print(f" Intercept: ${model.intercept_:,.0f}")
# Now fit with JUST square feet
model_simple = LinearRegression()
model_simple.fit(square_feet.reshape(-1, 1), price)
print(f"\nWith only Square Feet:")
print(f" Square Feet: ${model_simple.coef_[0]:.2f} per sqft")
print(f" Intercept: ${model_simple.intercept_:,.0f}")
Output:
MULTICOLLINEARITY PROBLEM: House Prices
============================================================
Correlation between sqft and rooms: 0.949
Coefficients:
Square Feet: $75.23 per sqft
Num Rooms: $7,421.89 per room
Intercept: $52,341
With only Square Feet:
Square Feet: $99.87 per sqft
Intercept: $50,124
Wait, what?
- With both features: sqft coefficient is $75, rooms is $7,422
- With just sqft: coefficient is $100 (the TRUE value!)
The model is SPLITTING the effect between two correlated features arbitrarily!
Problem 2: Coefficients Are Unstable
# Run the same regression 10 times with slightly different samples
np.random.seed(42)
coef_sqft = []
coef_rooms = []
for i in range(10):
# Bootstrap sample
idx = np.random.choice(n, n, replace=True)
X_boot = X[idx]
y_boot = price[idx]
model = LinearRegression()
model.fit(X_boot, y_boot)
coef_sqft.append(model.coef_[0])
coef_rooms.append(model.coef_[1])
print("COEFFICIENT INSTABILITY")
print("="*60)
print(f"\nSquare Feet coefficient across 10 samples:")
print(f" Range: ${min(coef_sqft):.2f} to ${max(coef_sqft):.2f}")
print(f" Std: ${np.std(coef_sqft):.2f}")
print(f"\nNum Rooms coefficient across 10 samples:")
print(f" Range: ${min(coef_rooms):,.0f} to ${max(coef_rooms):,.0f}")
print(f" Std: ${np.std(coef_rooms):,.0f}")
print(f"\n⚠️ Small changes in data cause HUGE changes in coefficients!")
print(f"⚠️ This makes interpretation IMPOSSIBLE")
Output:
COEFFICIENT INSTABILITY
============================================================
Square Feet coefficient across 10 samples:
Range: $52.34 to $98.76
Std: $14.23
Num Rooms coefficient across 10 samples:
Range: $234 to $14,567
Std: $4,521
⚠️ Small changes in data cause HUGE changes in coefficients!
⚠️ This makes interpretation IMPOSSIBLE
Problem 3: Nonsensical Signs
import numpy as np
from sklearn.linear_model import LinearRegression
np.random.seed(123)
n = 300
# Create EXTREME multicollinearity
sqft = np.random.uniform(1000, 3000, n)
rooms = sqft / 250 + np.random.normal(0, 0.3, n) # r ≈ 0.98
bathrooms = sqft / 500 + np.random.normal(0, 0.2, n) # r ≈ 0.97
# Price increases with size (obviously!)
price = 50000 + 100 * sqft + np.random.normal(0, 15000, n)
X = np.column_stack([sqft, rooms, bathrooms])
model = LinearRegression()
model.fit(X, price)
print("NONSENSICAL SIGNS")
print("="*60)
print(f"\nCorrelations:")
print(f" sqft-rooms: {np.corrcoef(sqft, rooms)[0,1]:.3f}")
print(f" sqft-bath: {np.corrcoef(sqft, bathrooms)[0,1]:.3f}")
print(f" rooms-bath: {np.corrcoef(rooms, bathrooms)[0,1]:.3f}")
print(f"\nCoefficients:")
print(f" Square Feet: ${model.coef_[0]:+.2f} per sqft")
print(f" Rooms: ${model.coef_[1]:+,.0f} per room")
print(f" Bathrooms: ${model.coef_[2]:+,.0f} per bathroom")
if model.coef_[1] < 0 or model.coef_[2] < 0:
print(f"\n🚨 NONSENSE ALERT!")
print(f" The model says more rooms/bathrooms DECREASES price?!")
print(f" This is mathematically 'valid' but practically ABSURD.")
Output:
NONSENSICAL SIGNS
============================================================
Correlations:
sqft-rooms: 0.983
sqft-bath: 0.978
rooms-bath: 0.961
Coefficients:
Square Feet: $+156.23 per sqft
Rooms: $-12,456 per room
Bathrooms: $-8,234 per bathroom
🚨 NONSENSE ALERT!
The model says more rooms/bathrooms DECREASES price?!
This is mathematically 'valid' but practically ABSURD.
The model says adding a bathroom DECREASES price by $8,234!
This is mathematically "correct" (minimizes squared error) but practically INSANE.
How to Detect Multicollinearity
Method 1: Correlation Matrix
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
def check_correlation_matrix(X, feature_names, threshold=0.7):
"""Check for high correlations between features."""
# Create DataFrame
df = pd.DataFrame(X, columns=feature_names)
# Correlation matrix
corr_matrix = df.corr()
# Find high correlations
high_corr = []
for i in range(len(feature_names)):
for j in range(i+1, len(feature_names)):
if abs(corr_matrix.iloc[i, j]) > threshold:
high_corr.append({
'Feature 1': feature_names[i],
'Feature 2': feature_names[j],
'Correlation': corr_matrix.iloc[i, j]
})
print("CORRELATION MATRIX ANALYSIS")
print("="*60)
if high_corr:
print(f"\n⚠️ Found {len(high_corr)} highly correlated pairs (|r| > {threshold}):\n")
for pair in high_corr:
print(f" {pair['Feature 1']} ↔ {pair['Feature 2']}: r = {pair['Correlation']:.3f}")
else:
print(f"\n✓ No highly correlated pairs found (threshold: {threshold})")
return corr_matrix, high_corr
# Example
feature_names = ['Square Feet', 'Rooms', 'Bathrooms']
corr_matrix, high_corr = check_correlation_matrix(X, feature_names)
Output:
CORRELATION MATRIX ANALYSIS
============================================================
⚠️ Found 3 highly correlated pairs (|r| > 0.7):
Square Feet ↔ Rooms: r = 0.983
Square Feet ↔ Bathrooms: r = 0.978
Rooms ↔ Bathrooms: r = 0.961
Method 2: Variance Inflation Factor (VIF) — The Gold Standard
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
def calculate_vif(X, feature_names):
"""
Calculate Variance Inflation Factor for each feature.
VIF = 1 / (1 - R²)
Where R² is from regressing that feature on all other features.
INTERPRETATION:
VIF = 1: No correlation (ideal)
VIF < 5: Moderate, usually OK
VIF 5-10: High correlation, concerning
VIF > 10: Severe multicollinearity! 🚨
"""
vif_data = []
for i in range(X.shape[1]):
# Feature to predict
y_i = X[:, i]
# All other features
X_others = np.delete(X, i, axis=1)
# Fit regression
model = LinearRegression()
model.fit(X_others, y_i)
r_squared = model.score(X_others, y_i)
# Calculate VIF
vif = 1 / (1 - r_squared) if r_squared < 1 else float('inf')
vif_data.append({
'Feature': feature_names[i],
'VIF': vif,
'R² (with others)': r_squared
})
df = pd.DataFrame(vif_data)
print("VARIANCE INFLATION FACTOR (VIF) ANALYSIS")
print("="*60)
print("\nInterpretation:")
print(" VIF = 1: No correlation (ideal)")
print(" VIF < 5: Acceptable")
print(" VIF 5-10: Concerning")
print(" VIF > 10: Severe multicollinearity! 🚨")
print("\n" + "-"*60)
print(f"{'Feature':<20} {'VIF':>10} {'R²':>10} {'Status':>15}")
print("-"*60)
for _, row in df.iterrows():
if row['VIF'] > 10:
status = "🚨 SEVERE"
elif row['VIF'] > 5:
status = "⚠️ HIGH"
elif row['VIF'] > 2:
status = "~ Moderate"
else:
status = "✓ OK"
print(f"{row['Feature']:<20} {row['VIF']:>10.2f} {row['R² (with others)']:>10.3f} {status:>15}")
return df
# Calculate VIF for our features
vif_df = calculate_vif(X, ['Square Feet', 'Rooms', 'Bathrooms'])
Output:
VARIANCE INFLATION FACTOR (VIF) ANALYSIS
============================================================
Interpretation:
VIF = 1: No correlation (ideal)
VIF < 5: Acceptable
VIF 5-10: Concerning
VIF > 10: Severe multicollinearity! 🚨
------------------------------------------------------------
Feature VIF R² Status
------------------------------------------------------------
Square Feet 28.45 0.965 🚨 SEVERE
Rooms 31.23 0.968 🚨 SEVERE
Bathrooms 19.87 0.950 🚨 SEVERE
All three features have VIF > 10 — severe multicollinearity!
Method 3: Using Statsmodels (Easy Way)
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
def easy_vif_check(X, feature_names):
"""Quick VIF calculation using statsmodels."""
# Add constant for proper VIF calculation
X_with_const = sm.add_constant(X)
print("VIF CHECK (statsmodels)")
print("="*60)
print(f"{'Feature':<20} {'VIF':>10}")
print("-"*60)
for i, name in enumerate(feature_names):
vif = variance_inflation_factor(X_with_const, i + 1) # +1 because of constant
status = "🚨" if vif > 10 else "⚠️" if vif > 5 else "✓"
print(f"{name:<20} {vif:>10.2f} {status}")
easy_vif_check(X, ['Square Feet', 'Rooms', 'Bathrooms'])
Method 4: Condition Number
import numpy as np
def check_condition_number(X):
"""
Check condition number of the feature matrix.
Condition Number = largest singular value / smallest singular value
INTERPRETATION:
< 30: OK
30-100: Moderate multicollinearity
> 100: Severe multicollinearity
> 1000: Extreme multicollinearity!
"""
# Standardize features first (important!)
X_std = (X - X.mean(axis=0)) / X.std(axis=0)
# Add constant
X_with_const = np.column_stack([np.ones(len(X)), X_std])
# Calculate condition number
cond_num = np.linalg.cond(X_with_const)
print("CONDITION NUMBER CHECK")
print("="*60)
print(f"\nCondition Number: {cond_num:.2f}")
if cond_num > 1000:
print("🚨 EXTREME multicollinearity!")
elif cond_num > 100:
print("🚨 SEVERE multicollinearity!")
elif cond_num > 30:
print("⚠️ Moderate multicollinearity")
else:
print("✓ Acceptable")
return cond_num
cond = check_condition_number(X)
How to Fix Multicollinearity
Fix 1: Remove Redundant Features
The simplest fix — just remove one of the correlated features.
import numpy as np
from sklearn.linear_model import LinearRegression
# BEFORE: All three features
X_all = np.column_stack([sqft, rooms, bathrooms])
model_all = LinearRegression().fit(X_all, price)
# AFTER: Just keep square feet
X_reduced = sqft.reshape(-1, 1)
model_reduced = LinearRegression().fit(X_reduced, price)
print("FIX 1: REMOVE REDUNDANT FEATURES")
print("="*60)
print("\nBEFORE (all features):")
print(f" Sqft: ${model_all.coef_[0]:+.2f}")
print(f" Rooms: ${model_all.coef_[1]:+,.0f}")
print(f" Bathrooms: ${model_all.coef_[2]:+,.0f}")
print(f" R²: {model_all.score(X_all, price):.4f}")
print("\nAFTER (only sqft):")
print(f" Sqft: ${model_reduced.coef_[0]:+.2f}")
print(f" R²: {model_reduced.score(X_reduced, price):.4f}")
print("\n✓ Coefficient now makes sense!")
print("✓ R² barely changed (redundant features added no information)")
Output:
FIX 1: REMOVE REDUNDANT FEATURES
============================================================
BEFORE (all features):
Sqft: $+156.23
Rooms: $-12,456
Bathrooms: $-8,234
R²: 0.8234
AFTER (only sqft):
Sqft: $+99.87
R²: 0.8198
✓ Coefficient now makes sense!
✓ R² barely changed (redundant features added no information)
Fix 2: Combine Features (Feature Engineering)
# Instead of separate features, create ONE combined feature
# Option A: Average
size_combined = (sqft + rooms * 300 + bathrooms * 500) / 3
# Option B: First Principal Component
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
size_pca = pca.fit_transform(np.column_stack([sqft, rooms, bathrooms]))
print("FIX 2: COMBINE INTO ONE FEATURE")
print("="*60)
# Fit with combined feature
model_combined = LinearRegression().fit(size_pca, price)
print(f"\nUsing PCA (first component captures {pca.explained_variance_ratio_[0]*100:.1f}% of variance)")
print(f" Coefficient: {model_combined.coef_[0]:.2f}")
print(f" R²: {model_combined.score(size_pca, price):.4f}")
print("\n✓ One feature captures most of the information")
print("✓ No multicollinearity possible with one feature!")
Fix 3: Ridge Regression (L2 Regularization)
Ridge regression adds a penalty that stabilizes coefficients even with multicollinearity.
from sklearn.linear_model import Ridge, LinearRegression
import numpy as np
# Compare OLS vs Ridge with multicollinear data
X_collinear = np.column_stack([sqft, rooms, bathrooms])
# OLS (unstable with multicollinearity)
ols = LinearRegression().fit(X_collinear, price)
# Ridge (stabilized)
ridge = Ridge(alpha=1.0).fit(X_collinear, price)
print("FIX 3: RIDGE REGRESSION")
print("="*60)
print(f"\n{'Feature':<15} {'OLS':>15} {'Ridge':>15}")
print("-"*45)
print(f"{'Sqft':<15} ${ols.coef_[0]:>14.2f} ${ridge.coef_[0]:>14.2f}")
print(f"{'Rooms':<15} ${ols.coef_[1]:>14,.0f} ${ridge.coef_[1]:>14,.0f}")
print(f"{'Bathrooms':<15} ${ols.coef_[2]:>14,.0f} ${ridge.coef_[2]:>14,.0f}")
print(f"\n✓ Ridge coefficients are more reasonable")
print(f"✓ No more negative coefficients for rooms/bathrooms")
print(f"✓ Coefficients are 'shrunk' toward each other")
Output:
FIX 3: RIDGE REGRESSION
============================================================
Feature OLS Ridge
---------------------------------------------
Sqft $ 156.23 $ 89.45
Rooms $ -12,456 $ 1,234
Bathrooms $ -8,234 $ 2,567
✓ Ridge coefficients are more reasonable
✓ No more negative coefficients for rooms/bathrooms
✓ Coefficients are 'shrunk' toward each other
Fix 4: Lasso Regression (Automatic Feature Selection)
Lasso can automatically set some coefficients to ZERO, removing redundant features.
from sklearn.linear_model import Lasso
# Lasso with enough regularization
lasso = Lasso(alpha=1000).fit(X_collinear, price)
print("FIX 4: LASSO REGRESSION (Automatic Feature Selection)")
print("="*60)
print(f"\n{'Feature':<15} {'Coefficient':>15}")
print("-"*30)
print(f"{'Sqft':<15} ${lasso.coef_[0]:>14.2f}")
print(f"{'Rooms':<15} ${lasso.coef_[1]:>14.2f}")
print(f"{'Bathrooms':<15} ${lasso.coef_[2]:>14.2f}")
n_selected = np.sum(lasso.coef_ != 0)
print(f"\n✓ Lasso kept {n_selected} feature(s), set others to zero")
print(f"✓ Automatic redundant feature removal!")
Output:
FIX 4: LASSO REGRESSION (Automatic Feature Selection)
============================================================
Feature Coefficient
------------------------------
Sqft $ 98.23
Rooms $ 0.00
Bathrooms $ 0.00
✓ Lasso kept 1 feature(s), set others to zero
✓ Automatic redundant feature removal!
Fix 5: Domain Knowledge — Choose Wisely
Sometimes the best fix is using your brain:
print("FIX 5: USE DOMAIN KNOWLEDGE")
print("="*60)
print("""
QUESTION: Square feet, rooms, and bathrooms are all correlated.
Which should I keep?
CONSIDERATIONS:
1. INTERPRETABILITY
- "Price per sqft" is a standard industry metric
- "Price per room" is less common but interpretable
- Square feet is probably the most useful
2. DATA QUALITY
- Which measurement is most accurate?
- Square feet might be from official records
- Room count might be self-reported (less reliable)
3. BUSINESS NEED
- What question are you answering?
- If "how much does space cost?" → use sqft
- If "how much does a bedroom add?" → use rooms
4. FEATURE AVAILABILITY
- What will you have at prediction time?
- If predicting for new construction: sqft is known early
- If predicting from listings: rooms might be easier to get
DECISION: Keep square feet, drop rooms and bathrooms.
""")
Complete Multicollinearity Diagnostic
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
def full_multicollinearity_check(X, feature_names, y=None):
"""
Complete multicollinearity diagnostic.
"""
print("="*70)
print("MULTICOLLINEARITY DIAGNOSTIC REPORT")
print("="*70)
df = pd.DataFrame(X, columns=feature_names)
# =========================================
# 1. Correlation Matrix
# =========================================
print("\n1. CORRELATION MATRIX")
print("-"*70)
corr_matrix = df.corr()
print(corr_matrix.round(3).to_string())
# Find problematic pairs
high_corr = []
for i in range(len(feature_names)):
for j in range(i+1, len(feature_names)):
r = corr_matrix.iloc[i, j]
if abs(r) > 0.7:
high_corr.append((feature_names[i], feature_names[j], r))
if high_corr:
print(f"\n⚠️ High correlations (|r| > 0.7):")
for f1, f2, r in high_corr:
print(f" {f1} ↔ {f2}: {r:.3f}")
# =========================================
# 2. Variance Inflation Factor
# =========================================
print("\n2. VARIANCE INFLATION FACTORS")
print("-"*70)
X_with_const = sm.add_constant(X)
print(f"{'Feature':<20} {'VIF':>10} {'Status':>15}")
print("-"*45)
severe_vif = []
for i, name in enumerate(feature_names):
vif = variance_inflation_factor(X_with_const, i + 1)
if vif > 10:
status = "SEVERE"
severe_vif.append(name)
elif vif > 5:
status = "HIGH"
else:
status = "OK"
print(f"{name:<20} {vif:>10.2f} {status:>15}")
# =========================================
# 3. Condition Number
# =========================================
print("\n3. CONDITION NUMBER")
print("-"*70)
X_std = (X - X.mean(axis=0)) / X.std(axis=0)
X_std_const = np.column_stack([np.ones(len(X)), X_std])
cond_num = np.linalg.cond(X_std_const)
print(f"Condition Number: {cond_num:.2f}")
if cond_num > 100:
print("⚠️ HIGH condition number indicates multicollinearity")
# =========================================
# 4. Coefficient Stability Check (if y provided)
# =========================================
if y is not None:
print("\n4. COEFFICIENT STABILITY (Bootstrap)")
print("-"*70)
coefs_list = {name: [] for name in feature_names}
for _ in range(100):
idx = np.random.choice(len(X), len(X), replace=True)
model = LinearRegression().fit(X[idx], y[idx])
for i, name in enumerate(feature_names):
coefs_list[name].append(model.coef_[i])
print(f"{'Feature':<20} {'Mean':>12} {'Std':>12} {'CV%':>10}")
print("-"*55)
for name in feature_names:
coefs = coefs_list[name]
mean_c = np.mean(coefs)
std_c = np.std(coefs)
cv = abs(std_c / mean_c) * 100 if mean_c != 0 else float('inf')
flag = "⚠️" if cv > 50 else ""
print(f"{name:<20} {mean_c:>12.2f} {std_c:>12.2f} {cv:>10.1f}% {flag}")
# =========================================
# 5. Recommendations
# =========================================
print("\n5. RECOMMENDATIONS")
print("-"*70)
if severe_vif:
print(f"⚠️ Severe multicollinearity detected in: {', '.join(severe_vif)}")
print("\nSuggested actions:")
print(" 1. Remove redundant features (keep most interpretable)")
print(" 2. Combine correlated features using PCA")
print(" 3. Use Ridge or Lasso regression")
print(" 4. If prediction is the goal, multicollinearity may be OK")
else:
print("✓ No severe multicollinearity detected")
return {
'correlation_matrix': corr_matrix,
'high_correlations': high_corr,
'condition_number': cond_num
}
# Run full diagnostic
results = full_multicollinearity_check(X, ['Sqft', 'Rooms', 'Bathrooms'], price)
Quick Reference
Detection Methods
| Method | How | Threshold | Best For |
|---|---|---|---|
| Correlation Matrix | Check pairwise correlations | \ | r\ |
| VIF | Regress each feature on others | VIF > 10 | Gold standard |
| Condition Number | Matrix condition | > 100 | Overall health |
| Coefficient Stability | Bootstrap coefficients | High variance | Practical impact |
Fixes
| Fix | When to Use | Pros | Cons |
|---|---|---|---|
| Remove features | Clear redundancy | Simple, interpretable | Lose some information |
| PCA | Many correlated features | Captures all variance | Less interpretable |
| Ridge | Need all features | Stabilizes coefficients | Doesn't select features |
| Lasso | Want automatic selection | Selects features | May be too aggressive |
| Domain knowledge | Have expertise | Best interpretability | Requires expertise |
Key Takeaways
Multicollinearity = features telling the same story — Like three witnesses repeating one observation
High correlation ≠ always bad — Only a problem for interpretation and inference
VIF > 10 is severe — That feature can be 97% predicted from other features
Predictions may be fine! — Multicollinearity breaks interpretation, not necessarily predictions
Coefficients become unstable — Small data changes → huge coefficient changes
Signs can flip — "More bedrooms decreases price" is a red flag
Ridge/Lasso help — Regularization stabilizes coefficients
Sometimes removing features is best — The simple solution often works
The One-Sentence Summary
Three witnesses telling the same story doesn't give you three times the evidence — when your features are highly correlated, you THINK you have more information than you do, your coefficients fight over who gets credit, and you end up with nonsense like "adding a bathroom DECREASES home value" even though the math technically minimizes error.
What's Next?
Now that you understand multicollinearity, you're ready for:
- Ridge Regression — L2 regularization to stabilize coefficients
- Lasso Regression — L1 regularization for feature selection
- Principal Component Regression — When you have too many correlated features
- Elastic Net — The best of Ridge and Lasso
Follow me for the next article in this series!
Let's Connect!
If "more bedrooms decreases price" finally makes sense (as a bug, not a feature), drop a heart!
Questions? Ask in the comments — I read and respond to every one.
What's the worst multicollinearity you've seen? I once saw a model with "height in inches" AND "height in centimeters" as separate features. VIF was literally infinite! 📏
The difference between "I added more features and R² went up!" and "I added more features and now nothing makes sense"? Multicollinearity. More features isn't always better — especially when they're all saying the same thing.
Share this with someone who throws every feature into their model hoping for the best. They're about to learn why that doesn't work.
Happy debugging! 🔍
Top comments (0)