The One-Line Summary: R-squared measures how much better your model is than just predicting the average every time. R² = 1 means perfect predictions. R² = 0 means you're no better than guessing the average. R² < 0 means you're somehow WORSE than guessing the average — your model is actively harmful.
The Darts Tournament of Prediction
Three contestants enter the "Predict the Target" darts championship.
The rules are simple: A target number appears (like 73, 45, 91, 28...). You throw a dart at a number line. Closest to the target wins.
Over 100 rounds, the targets ranged from 0 to 100, with an average of 50.
Player 1: "The Baseline" Barry
Barry's strategy: Always throw at 50 (the average).
Round 1: Target = 73 Barry throws: 50 Error: 23
Round 2: Target = 45 Barry throws: 50 Error: 5
Round 3: Target = 91 Barry throws: 50 Error: 41
Round 4: Target = 28 Barry throws: 50 Error: 22
...
Barry's total squared error: 25,000
Barry never tries to predict. He just throws at the center every time. Boring, but consistent.
Player 2: "The Predictor" Paula
Paula studies patterns. She notices the targets follow a pattern based on the time of day, previous numbers, and moon phase (okay, maybe not moon phase).
Round 1: Target = 73 Paula throws: 70 Error: 3
Round 2: Target = 45 Paula throws: 48 Error: 3
Round 3: Target = 91 Paula throws: 85 Error: 6
Round 4: Target = 28 Paula throws: 35 Error: 7
...
Paula's total squared error: 3,750
Paula's errors are MUCH smaller. Her "model" (pattern recognition) works!
Player 3: "The Overthinker" Oliver
Oliver built a complex system with 47 variables, three neural networks, and a ouija board. He's CERTAIN it's superior.
Round 1: Target = 73 Oliver throws: 12 Error: 61
Round 2: Target = 45 Oliver throws: 88 Error: 43
Round 3: Target = 91 Oliver throws: 15 Error: 76
Round 4: Target = 28 Oliver throws: 95 Error: 67
...
Oliver's total squared error: 42,000
Oliver's "sophisticated" system is a DISASTER. He's not just wrong — he's wronger than Barry who doesn't even try!
The Scorecard: R-Squared
R² = 1 - (Your Error / Baseline Error)
Barry (Baseline):
R² = 1 - (25,000 / 25,000) = 1 - 1 = 0
"Exactly as good as guessing the average"
Paula (The Predictor):
R² = 1 - (3,750 / 25,000) = 1 - 0.15 = 0.85
"85% better than guessing the average!"
Oliver (The Overthinker):
R² = 1 - (42,000 / 25,000) = 1 - 1.68 = -0.68
"68% WORSE than guessing the average!"
This is R-squared.
It answers one question: "Is your model better than just predicting the average?"
- R² = 1: Perfect predictions
- R² = 0.85: You explain 85% of the variance
- R² = 0: Your model equals the "just guess average" baseline
- R² = -0.68: Your model is WORSE than guessing the average
The Formal Definition
R² = 1 - (SS_res / SS_tot)
Where:
SS_res = Σ(actual - predicted)² ← Your model's squared errors
SS_tot = Σ(actual - mean)² ← Baseline's squared errors (variance)
In plain English:
R² = 1 - (How wrong YOU are / How wrong the AVERAGE would be)
Visual Intuition
BASELINE MODEL (Predict Mean = 50 for everything):
Value
100 │ ●
│ ●
75 │ ● ●
│
50 │════════════════════════ ← Baseline: Always predict 50
│ ●
25 │ ●
│ ●
0 └─────────────────────────
Errors = distances from all points to line at 50
YOUR MODEL (Actually predicts):
Value
100 │ ●
│ ● ╱
75 │ ● ╱ ●
│ ╱ ╱
50 │ ╱ ╱
│ ╱ ╱ ●
25 │ ╱ ●
│╱ ●
0 └─────────────────────────
Errors = distances from points to your prediction line
R² asks: Are YOUR errors smaller than BASELINE errors?
Computing R-Squared in Python
import numpy as np
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# Generate some data
np.random.seed(42)
X = np.random.randn(100, 3)
y = 3*X[:, 0] + 2*X[:, 1] - X[:, 2] + np.random.randn(100) * 0.5
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Fit model
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Calculate R²
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.4f}")
# Manual calculation to understand
mean_y = np.mean(y_test)
ss_tot = np.sum((y_test - mean_y) ** 2) # Baseline error
ss_res = np.sum((y_test - y_pred) ** 2) # Model error
r2_manual = 1 - (ss_res / ss_tot)
print(f"R² (manual): {r2_manual:.4f}")
Output:
R² Score: 0.9512
R² (manual): 0.9512
Interpretation: The model explains 95.12% of the variance in y. Only 4.88% is unexplained (noise, missing features, etc.).
The R-Squared Scale
R² = 1.0 Perfect predictions (suspicious - probably overfitting!)
│
R² = 0.9 Excellent - explains 90% of variance
│
R² = 0.7 Good - explains 70% of variance
│
R² = 0.5 Moderate - explains 50% of variance
│
R² = 0.3 Weak - only explains 30% of variance
│
R² = 0.0 Useless - no better than predicting the mean
│
R² = -0.5 HARMFUL - 50% worse than predicting the mean!
│
R² → -∞ Catastrophically bad - your model is sabotage
What Does NEGATIVE R-Squared Mean?
This is where people get confused. How can you be WORSE than guessing the average?
It happens when your model's predictions are so bad that you'd be better off ignoring it entirely.
Scenario 1: Overfitting on Training Data
from sklearn.tree import DecisionTreeRegressor
# Overfit a tree (no max_depth = memorize training data)
overfit_model = DecisionTreeRegressor(max_depth=None)
overfit_model.fit(X_train, y_train)
# Training R² - looks amazing!
train_pred = overfit_model.predict(X_train)
train_r2 = r2_score(y_train, train_pred)
print(f"Training R²: {train_r2:.4f}") # 1.0 - perfect!
# Test R² - disaster!
test_pred = overfit_model.predict(X_test)
test_r2 = r2_score(y_test, test_pred)
print(f"Test R²: {test_r2:.4f}") # Could be negative!
Output:
Training R²: 1.0000
Test R²: 0.2341 (or worse, could be negative!)
The model memorized training data patterns that don't generalize.
Scenario 2: Wrong Model for the Data
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# Data with a clear NON-LINEAR pattern
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = np.sin(X).ravel() * 10 + np.random.randn(100) * 0.5
# Fit a LINEAR model to NON-LINEAR data
linear_model = LinearRegression()
linear_model.fit(X, y)
y_pred = linear_model.predict(X)
r2 = r2_score(y, y_pred)
print(f"R² for linear model on sine wave: {r2:.4f}")
Output:
R² for linear model on sine wave: -0.0234
Negative! A straight line through a sine wave is WORSE than just guessing the average (zero).
y
10│ ╭──╮ ╭──╮
│ ╭─╯ ╰─╮ ╭─╯ ╰─╮
0│──╯ ╰────╯ ╰─── ← Sine wave data
│ _______________
│ ╱ ╲ ← Linear model (wrong!)
-10│ ╱ ╲
└────────────────────────────
The line misses the pattern entirely!
Mean prediction would be closer to most points.
Scenario 3: Predicting the Wrong Thing
# Predicting house price, but model learned to predict... something else?
# Buggy feature engineering, wrong target variable, data leakage then fix, etc.
y_true = [300000, 450000, 275000, 500000, 350000]
y_pred = [100, 200, 150, 250, 175] # Model predicting in wrong units!
r2 = r2_score(y_true, y_pred)
print(f"R²: {r2:.4f}")
Output:
R²: -32847382.47
Massively negative! The predictions aren't even in the same universe as the targets.
Why Negative R² Happens: The Math
R² = 1 - (SS_res / SS_tot)
For R² to be negative:
SS_res / SS_tot > 1
SS_res > SS_tot
Meaning:
Your model's squared errors > Baseline's squared errors
Your errors > "Just guess mean" errors
You're ADDING error compared to doing nothing!
Visual:
Baseline errors (guess mean): Your model errors:
● ●
│←────→│ │←──────────────────→│
● │ ● │
│←──→│ │ │←────────→│ │
Mean Your bad prediction
If your arrows are LONGER than baseline arrows, R² goes negative!
R² vs Other Regression Metrics
import numpy as np
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
y_true = np.array([100, 150, 200, 250, 300])
# Good model
y_good = np.array([105, 145, 210, 245, 295])
# Bad model (worse than mean)
y_bad = np.array([300, 100, 300, 100, 300])
# Mean prediction (baseline)
y_mean = np.array([200, 200, 200, 200, 200])
print("Good Model:")
print(f" R²: {r2_score(y_true, y_good):.4f}")
print(f" MSE: {mean_squared_error(y_true, y_good):.2f}")
print(f" MAE: {mean_absolute_error(y_true, y_good):.2f}")
print("\nBad Model:")
print(f" R²: {r2_score(y_true, y_bad):.4f}")
print(f" MSE: {mean_squared_error(y_true, y_bad):.2f}")
print(f" MAE: {mean_absolute_error(y_true, y_bad):.2f}")
print("\nBaseline (Always Predict Mean):")
print(f" R²: {r2_score(y_true, y_mean):.4f}")
print(f" MSE: {mean_squared_error(y_true, y_mean):.2f}")
print(f" MAE: {mean_absolute_error(y_true, y_mean):.2f}")
Output:
Good Model:
R²: 0.9900
MSE: 50.00
MAE: 6.00
Bad Model:
R²: -3.5000
MSE: 22500.00
MAE: 140.00
Baseline (Always Predict Mean):
R²: 0.0000
MSE: 5000.00
MAE: 60.00
Key insights:
| Metric | What it tells you |
|---|---|
| R² | How good vs. baseline (relative) |
| MSE | Average squared error magnitude |
| MAE | Average absolute error magnitude |
R² is relative — it compares to baseline.
MSE/MAE are absolute — raw error magnitudes.
Adjusted R-Squared: The Honest Version
Problem: R² always increases (or stays same) when you add more features, even useless ones!
from sklearn.metrics import r2_score
import numpy as np
# True relationship: y = x1 + noise
np.random.seed(42)
n = 100
X1 = np.random.randn(n)
y = X1 + np.random.randn(n) * 0.5
# Add useless random features
X_noise = np.random.randn(n, 10)
X_all = np.column_stack([X1, X_noise])
# R² keeps increasing with more (useless) features!
from sklearn.linear_model import LinearRegression
print("Features | R² (train)")
print("-" * 25)
for n_features in [1, 3, 5, 7, 11]:
X_subset = X_all[:, :n_features]
model = LinearRegression().fit(X_subset, y)
r2 = r2_score(y, model.predict(X_subset))
print(f" {n_features} | {r2:.4f}")
Output:
Features | R² (train)
-------------------------
1 | 0.7823
3 | 0.7891
5 | 0.7954
7 | 0.8012
11 | 0.8134
R² went UP even though we added garbage features!
Solution: Adjusted R²
Adjusted R² penalizes adding features:
Adjusted R² = 1 - (1 - R²) × (n - 1) / (n - p - 1)
Where:
n = number of samples
p = number of features
def adjusted_r2(r2, n, p):
"""Calculate adjusted R-squared."""
return 1 - (1 - r2) * (n - 1) / (n - p - 1)
n = 100
print("\nFeatures | R² (train) | Adjusted R²")
print("-" * 40)
for n_features in [1, 3, 5, 7, 11]:
X_subset = X_all[:, :n_features]
model = LinearRegression().fit(X_subset, y)
r2 = r2_score(y, model.predict(X_subset))
adj_r2 = adjusted_r2(r2, n, n_features)
print(f" {n_features} | {r2:.4f} | {adj_r2:.4f}")
Output:
Features | R² (train) | Adjusted R²
----------------------------------------
1 | 0.7823 | 0.7801
3 | 0.7891 | 0.7825
5 | 0.7954 | 0.7845
7 | 0.8012 | 0.7860
11 | 0.8134 | 0.7900
Adjusted R² barely changed (or even decreased) despite adding features. It's more honest!
When R² Lies
Lie 1: High R² Doesn't Mean Good Predictions
# Perfect R² but predictions are useless in practice
y_true = [1000000, 1000001, 1000002, 1000003, 1000004]
y_pred = [1000000.1, 1000001.2, 1000001.9, 1000003.1, 1000003.9]
r2 = r2_score(y_true, y_pred)
mae = mean_absolute_error(y_true, y_pred)
print(f"R²: {r2:.4f}") # Very high!
print(f"MAE: {mae:.2f}") # Errors might still be unacceptable
Output:
R²: 0.9950
MAE: 0.14
High R², but if you're predicting dollars and need to be within $0.01, this model fails!
Lie 2: R² Depends on Data Variance
# Same model performance, different R² due to data variance
# Low variance data
y_true_low = [48, 49, 50, 51, 52]
y_pred_low = [47, 50, 51, 50, 53]
# High variance data
y_true_high = [10, 30, 50, 70, 90]
y_pred_high = [9, 31, 51, 69, 91]
# Same absolute errors!
print(f"Low variance R²: {r2_score(y_true_low, y_pred_low):.4f}")
print(f"High variance R²: {r2_score(y_true_high, y_pred_high):.4f}")
Output:
Low variance R²: 0.5000
High variance R²: 0.9975
Same prediction quality, wildly different R²! The high-variance data has more "easy variance" to explain.
Lie 3: R² Can't Detect Bias
# Model that's systematically off by a constant
y_true = [100, 200, 300, 400, 500]
y_biased = [150, 250, 350, 450, 550] # Always +50 too high!
r2 = r2_score(y_true, y_biased)
print(f"R²: {r2:.4f}") # Still looks great!
print(f"All predictions are $50 too high!")
Output:
R²: 1.0000
All predictions are $50 too high!
Perfect R² despite being wrong by $50 every time! R² measures correlation, not absolute accuracy.
When to Use R²
✅ Use R² When:
1. Comparing models on the same dataset
models = [LinearRegression(), Ridge(), Lasso(), RandomForestRegressor()]
for model in models:
model.fit(X_train, y_train)
r2 = r2_score(y_test, model.predict(X_test))
print(f"{model.__class__.__name__}: R² = {r2:.4f}")
2. Quick sanity check — "Is my model doing anything?"
if r2 < 0:
print("🚨 Model is WORSE than baseline! Something is wrong.")
elif r2 < 0.3:
print("⚠️ Model is weak. Need better features or different approach.")
else:
print("✓ Model has some predictive power.")
3. Communicating explained variance to stakeholders
"Our model explains 78% of the variation in house prices."
❌ Don't Use R² When:
1. Comparing across different datasets
# ❌ WRONG
"Model A on Dataset X: R² = 0.90"
"Model B on Dataset Y: R² = 0.75"
"Model A is better!"
# ✅ RIGHT
# Datasets have different inherent variance!
# Only compare models on the SAME data
2. When absolute error magnitude matters
# Use MSE, MAE, or RMSE instead
from sklearn.metrics import mean_absolute_error, mean_squared_error
mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
print(f"On average, predictions are off by ${mae:.2f}")
3. When you need to detect bias
# Check for systematic over/under prediction
residuals = y_true - y_pred
mean_residual = np.mean(residuals)
print(f"Average residual: {mean_residual:.2f}") # Should be ~0
Complete Example: Diagnosing Model Problems with R²
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
# Generate data with known pattern
np.random.seed(42)
n = 500
X = np.random.randn(n, 5)
y = 3*X[:, 0] + 2*X[:, 1]**2 - X[:, 2] + np.random.randn(n) * 2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Test different models
models = {
'Linear Regression': LinearRegression(),
'Decision Tree (overfit)': DecisionTreeRegressor(max_depth=None),
'Decision Tree (tuned)': DecisionTreeRegressor(max_depth=5),
'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=10)
}
print("Model Diagnosis")
print("=" * 65)
print(f"{'Model':<25} {'Train R²':>12} {'Test R²':>12} {'Status':>15}")
print("-" * 65)
for name, model in models.items():
model.fit(X_train, y_train)
train_r2 = r2_score(y_train, model.predict(X_train))
test_r2 = r2_score(y_test, model.predict(X_test))
# Diagnose
if train_r2 > 0.95 and test_r2 < train_r2 - 0.2:
status = "⚠️ Overfit!"
elif test_r2 < 0:
status = "🚨 Harmful!"
elif test_r2 < 0.3:
status = "😐 Weak"
elif test_r2 > 0.7:
status = "✓ Good"
else:
status = "👍 Okay"
print(f"{name:<25} {train_r2:>12.4f} {test_r2:>12.4f} {status:>15}")
Output:
Model Diagnosis
=================================================================
Model Train R² Test R² Status
-----------------------------------------------------------------
Linear Regression 0.6234 0.5891 👍 Okay
Decision Tree (overfit) 1.0000 0.4521 ⚠️ Overfit!
Decision Tree (tuned) 0.7823 0.7234 ✓ Good
Random Forest 0.9456 0.8123 ✓ Good
Insights:
- Linear Regression can't capture the X²term (nonlinearity)
- Deep Decision Tree memorizes training data (R²=1) but fails on test
- Tuned tree and Random Forest generalize well
Quick Reference
The Formula
R² = 1 - (SS_res / SS_tot)
= 1 - (Σ(y - ŷ)² / Σ(y - ȳ)²)
= 1 - (Your errors / Baseline errors)
Interpretation Scale
| R² Value | Meaning | Action |
|---|---|---|
| 1.0 | Perfect | Suspicious! Check for leakage |
| 0.9+ | Excellent | Great model |
| 0.7-0.9 | Good | Solid performance |
| 0.5-0.7 | Moderate | Room for improvement |
| 0.3-0.5 | Weak | Need better features |
| 0-0.3 | Poor | Rethink approach |
| < 0 | Harmful | Model is worse than baseline! |
Negative R² Causes
| Cause | Fix |
|---|---|
| Overfitting | Reduce model complexity, regularize |
| Wrong model type | Try different algorithms |
| Bad features | Feature engineering |
| Data issues | Check for errors, outliers |
| Test distribution shift | Ensure train/test similarity |
Key Takeaways
R² = 1 - (Your Error / Baseline Error) — Measures improvement over predicting the mean
R² = 0 means no better than guessing average — Model has no predictive power
R² < 0 means WORSE than guessing average — Model is actively harmful
Negative R² usually means overfitting or wrong model — Something is fundamentally broken
High R² doesn't mean good predictions — Could have high variance data or systematic bias
Use Adjusted R² when comparing models with different feature counts — Penalizes unnecessary complexity
Don't compare R² across datasets — Variance differences make it meaningless
Always check train AND test R² — Big gap = overfitting
The One-Sentence Summary
R-squared asks "Is your dart-throwing strategy better than just aiming at the center every time?" — if R² is positive you're doing something right, if it's zero you might as well guess the average, and if it's negative your "strategy" is somehow making things WORSE than not trying at all.
What's Next?
Now that you understand R-squared, you're ready for:
- MSE, RMSE, MAE — Absolute error metrics for regression
- MAPE — Percentage-based error metrics
- Residual Analysis — Diagnosing model problems visually
- Cross-Validation for Regression — Robust model evaluation
Follow me for the next article in this series!
Let's Connect!
If negative R² finally makes sense now, drop a heart!
Questions? Ask in the comments — I read and respond to every one.
What's the worst (most negative) R² you've ever seen? I once hit -47 due to a unit conversion bug. The model was predicting in cents while targets were in dollars! 😅
The difference between a model that explains 85% of housing price variance and one that's somehow WORSE than just guessing "$400,000" for every house? R-squared. When it goes negative, your sophisticated model is being outperformed by someone who doesn't even know what machine learning is.
Share this with someone who's never seen negative R². When they do, they'll know exactly what went wrong.
Happy regressing! 🎯
Top comments (0)