Sachin Kr. Rajput

Posted on Jan 21

R-Squared Explained: The Dart Player Who's Somehow WORSE Than Just Aiming at the Center

#python #machinelearning #datascience #beginners

The One-Line Summary: R-squared measures how much better your model is than just predicting the average every time. R² = 1 means perfect predictions. R² = 0 means you're no better than guessing the average. R² < 0 means you're somehow WORSE than guessing the average — your model is actively harmful.

The Darts Tournament of Prediction

Three contestants enter the "Predict the Target" darts championship.

The rules are simple: A target number appears (like 73, 45, 91, 28...). You throw a dart at a number line. Closest to the target wins.

Over 100 rounds, the targets ranged from 0 to 100, with an average of 50.

Player 1: "The Baseline" Barry

Barry's strategy: Always throw at 50 (the average).

Round 1: Target = 73    Barry throws: 50    Error: 23
Round 2: Target = 45    Barry throws: 50    Error: 5
Round 3: Target = 91    Barry throws: 50    Error: 41
Round 4: Target = 28    Barry throws: 50    Error: 22
...

Barry's total squared error: 25,000

Barry never tries to predict. He just throws at the center every time. Boring, but consistent.

Player 2: "The Predictor" Paula

Paula studies patterns. She notices the targets follow a pattern based on the time of day, previous numbers, and moon phase (okay, maybe not moon phase).

Round 1: Target = 73    Paula throws: 70    Error: 3
Round 2: Target = 45    Paula throws: 48    Error: 3
Round 3: Target = 91    Paula throws: 85    Error: 6
Round 4: Target = 28    Paula throws: 35    Error: 7
...

Paula's total squared error: 3,750

Paula's errors are MUCH smaller. Her "model" (pattern recognition) works!

Player 3: "The Overthinker" Oliver

Oliver built a complex system with 47 variables, three neural networks, and a ouija board. He's CERTAIN it's superior.

Round 1: Target = 73    Oliver throws: 12    Error: 61
Round 2: Target = 45    Oliver throws: 88    Error: 43
Round 3: Target = 91    Oliver throws: 15    Error: 76
Round 4: Target = 28    Oliver throws: 95    Error: 67
...

Oliver's total squared error: 42,000

Oliver's "sophisticated" system is a DISASTER. He's not just wrong — he's wronger than Barry who doesn't even try!

The Scorecard: R-Squared

R² = 1 - (Your Error / Baseline Error)

Barry (Baseline):
R² = 1 - (25,000 / 25,000) = 1 - 1 = 0
"Exactly as good as guessing the average"

Paula (The Predictor):
R² = 1 - (3,750 / 25,000) = 1 - 0.15 = 0.85
"85% better than guessing the average!"

Oliver (The Overthinker):
R² = 1 - (42,000 / 25,000) = 1 - 1.68 = -0.68
"68% WORSE than guessing the average!"

This is R-squared.

It answers one question: "Is your model better than just predicting the average?"

R² = 1: Perfect predictions
R² = 0.85: You explain 85% of the variance
R² = 0: Your model equals the "just guess average" baseline
R² = -0.68: Your model is WORSE than guessing the average

The Formal Definition

R² = 1 - (SS_res / SS_tot)

Where:
  SS_res = Σ(actual - predicted)²    ← Your model's squared errors
  SS_tot = Σ(actual - mean)²         ← Baseline's squared errors (variance)

In plain English:

R² = 1 - (How wrong YOU are / How wrong the AVERAGE would be)

Visual Intuition

BASELINE MODEL (Predict Mean = 50 for everything):

Value
100 │              ●
    │         ●
 75 │    ●              ●
    │
 50 │════════════════════════  ← Baseline: Always predict 50
    │              ●
 25 │       ●
    │                    ●
  0 └─────────────────────────

    Errors = distances from all points to line at 50


YOUR MODEL (Actually predicts):

Value
100 │              ●
    │         ●  ╱
 75 │    ●    ╱        ●
    │      ╱      ╱
 50 │    ╱    ╱
    │  ╱   ╱       ●
 25 │ ╱  ●
    │╱               ●
  0 └─────────────────────────

    Errors = distances from points to your prediction line


R² asks: Are YOUR errors smaller than BASELINE errors?

Computing R-Squared in Python

import numpy as np
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Generate some data
np.random.seed(42)
X = np.random.randn(100, 3)
y = 3*X[:, 0] + 2*X[:, 1] - X[:, 2] + np.random.randn(100) * 0.5

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Fit model
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Calculate R²
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.4f}")

# Manual calculation to understand
mean_y = np.mean(y_test)
ss_tot = np.sum((y_test - mean_y) ** 2)      # Baseline error
ss_res = np.sum((y_test - y_pred) ** 2)      # Model error
r2_manual = 1 - (ss_res / ss_tot)
print(f"R² (manual): {r2_manual:.4f}")

Output:

R² Score: 0.9512
R² (manual): 0.9512

Interpretation: The model explains 95.12% of the variance in y. Only 4.88% is unexplained (noise, missing features, etc.).

The R-Squared Scale

R² = 1.0    Perfect predictions (suspicious - probably overfitting!)
    │
R² = 0.9    Excellent - explains 90% of variance
    │
R² = 0.7    Good - explains 70% of variance  
    │
R² = 0.5    Moderate - explains 50% of variance
    │
R² = 0.3    Weak - only explains 30% of variance
    │
R² = 0.0    Useless - no better than predicting the mean
    │
R² = -0.5   HARMFUL - 50% worse than predicting the mean!
    │
R² → -∞     Catastrophically bad - your model is sabotage

What Does NEGATIVE R-Squared Mean?

This is where people get confused. How can you be WORSE than guessing the average?

It happens when your model's predictions are so bad that you'd be better off ignoring it entirely.

Scenario 1: Overfitting on Training Data

from sklearn.tree import DecisionTreeRegressor

# Overfit a tree (no max_depth = memorize training data)
overfit_model = DecisionTreeRegressor(max_depth=None)
overfit_model.fit(X_train, y_train)

# Training R² - looks amazing!
train_pred = overfit_model.predict(X_train)
train_r2 = r2_score(y_train, train_pred)
print(f"Training R²: {train_r2:.4f}")  # 1.0 - perfect!

# Test R² - disaster!
test_pred = overfit_model.predict(X_test)
test_r2 = r2_score(y_test, test_pred)
print(f"Test R²: {test_r2:.4f}")  # Could be negative!

Output:

Training R²: 1.0000
Test R²: 0.2341  (or worse, could be negative!)

The model memorized training data patterns that don't generalize.

Scenario 2: Wrong Model for the Data

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Data with a clear NON-LINEAR pattern
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = np.sin(X).ravel() * 10 + np.random.randn(100) * 0.5

# Fit a LINEAR model to NON-LINEAR data
linear_model = LinearRegression()
linear_model.fit(X, y)
y_pred = linear_model.predict(X)

r2 = r2_score(y, y_pred)
print(f"R² for linear model on sine wave: {r2:.4f}")

Output:

R² for linear model on sine wave: -0.0234

Negative! A straight line through a sine wave is WORSE than just guessing the average (zero).

  y
 10│    ╭──╮        ╭──╮
   │  ╭─╯  ╰─╮    ╭─╯  ╰─╮
  0│──╯      ╰────╯      ╰───  ← Sine wave data
   │         _______________
   │        ╱               ╲  ← Linear model (wrong!)
-10│      ╱                   ╲
   └────────────────────────────

The line misses the pattern entirely!
Mean prediction would be closer to most points.

Scenario 3: Predicting the Wrong Thing

# Predicting house price, but model learned to predict... something else?
# Buggy feature engineering, wrong target variable, data leakage then fix, etc.

y_true = [300000, 450000, 275000, 500000, 350000]
y_pred = [100, 200, 150, 250, 175]  # Model predicting in wrong units!

r2 = r2_score(y_true, y_pred)
print(f"R²: {r2:.4f}")

Output:

R²: -32847382.47

Massively negative! The predictions aren't even in the same universe as the targets.

Why Negative R² Happens: The Math

R² = 1 - (SS_res / SS_tot)

For R² to be negative:
  SS_res / SS_tot > 1
  SS_res > SS_tot

Meaning:
  Your model's squared errors > Baseline's squared errors
  Your errors > "Just guess mean" errors

You're ADDING error compared to doing nothing!

Visual:

Baseline errors (guess mean):         Your model errors:

    ●                                     ●
    │←────→│                              │←──────────────────→│
    ●      │                              ●                    │
    │←──→│ │                              │←────────→│         │
         Mean                                   Your bad prediction

If your arrows are LONGER than baseline arrows, R² goes negative!

R² vs Other Regression Metrics

import numpy as np
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

y_true = np.array([100, 150, 200, 250, 300])

# Good model
y_good = np.array([105, 145, 210, 245, 295])

# Bad model (worse than mean)
y_bad = np.array([300, 100, 300, 100, 300])

# Mean prediction (baseline)
y_mean = np.array([200, 200, 200, 200, 200])

print("Good Model:")
print(f"  R²:   {r2_score(y_true, y_good):.4f}")
print(f"  MSE:  {mean_squared_error(y_true, y_good):.2f}")
print(f"  MAE:  {mean_absolute_error(y_true, y_good):.2f}")

print("\nBad Model:")
print(f"  R²:   {r2_score(y_true, y_bad):.4f}")
print(f"  MSE:  {mean_squared_error(y_true, y_bad):.2f}")
print(f"  MAE:  {mean_absolute_error(y_true, y_bad):.2f}")

print("\nBaseline (Always Predict Mean):")
print(f"  R²:   {r2_score(y_true, y_mean):.4f}")
print(f"  MSE:  {mean_squared_error(y_true, y_mean):.2f}")
print(f"  MAE:  {mean_absolute_error(y_true, y_mean):.2f}")

Output:

Good Model:
  R²:   0.9900
  MSE:  50.00
  MAE:  6.00

Bad Model:
  R²:   -3.5000
  MSE:  22500.00
  MAE:  140.00

Baseline (Always Predict Mean):
  R²:   0.0000
  MSE:  5000.00
  MAE:  60.00

Key insights:

Metric	What it tells you
R²	How good vs. baseline (relative)
MSE	Average squared error magnitude
MAE	Average absolute error magnitude

R² is relative — it compares to baseline.
MSE/MAE are absolute — raw error magnitudes.

Adjusted R-Squared: The Honest Version

Problem: R² always increases (or stays same) when you add more features, even useless ones!

from sklearn.metrics import r2_score
import numpy as np

# True relationship: y = x1 + noise
np.random.seed(42)
n = 100
X1 = np.random.randn(n)
y = X1 + np.random.randn(n) * 0.5

# Add useless random features
X_noise = np.random.randn(n, 10)
X_all = np.column_stack([X1, X_noise])

# R² keeps increasing with more (useless) features!
from sklearn.linear_model import LinearRegression

print("Features | R² (train)")
print("-" * 25)
for n_features in [1, 3, 5, 7, 11]:
    X_subset = X_all[:, :n_features]
    model = LinearRegression().fit(X_subset, y)
    r2 = r2_score(y, model.predict(X_subset))
    print(f"    {n_features}    |   {r2:.4f}")

Output:

Features | R² (train)
-------------------------
    1    |   0.7823
    3    |   0.7891
    5    |   0.7954
    7    |   0.8012
    11   |   0.8134

R² went UP even though we added garbage features!

Solution: Adjusted R²

Adjusted R² penalizes adding features:

Adjusted R² = 1 - (1 - R²) × (n - 1) / (n - p - 1)

Where:
  n = number of samples
  p = number of features

def adjusted_r2(r2, n, p):
    """Calculate adjusted R-squared."""
    return 1 - (1 - r2) * (n - 1) / (n - p - 1)

n = 100
print("\nFeatures | R² (train) | Adjusted R²")
print("-" * 40)
for n_features in [1, 3, 5, 7, 11]:
    X_subset = X_all[:, :n_features]
    model = LinearRegression().fit(X_subset, y)
    r2 = r2_score(y, model.predict(X_subset))
    adj_r2 = adjusted_r2(r2, n, n_features)
    print(f"    {n_features}     |   {r2:.4f}    |    {adj_r2:.4f}")

Output:

Features | R² (train) | Adjusted R²
----------------------------------------
    1     |   0.7823    |    0.7801
    3     |   0.7891    |    0.7825
    5     |   0.7954    |    0.7845
    7     |   0.8012    |    0.7860
    11    |   0.8134    |    0.7900

Adjusted R² barely changed (or even decreased) despite adding features. It's more honest!

When R² Lies

Lie 1: High R² Doesn't Mean Good Predictions

# Perfect R² but predictions are useless in practice
y_true = [1000000, 1000001, 1000002, 1000003, 1000004]
y_pred = [1000000.1, 1000001.2, 1000001.9, 1000003.1, 1000003.9]

r2 = r2_score(y_true, y_pred)
mae = mean_absolute_error(y_true, y_pred)

print(f"R²: {r2:.4f}")  # Very high!
print(f"MAE: {mae:.2f}")  # Errors might still be unacceptable

Output:

R²: 0.9950
MAE: 0.14

High R², but if you're predicting dollars and need to be within $0.01, this model fails!

Lie 2: R² Depends on Data Variance

# Same model performance, different R² due to data variance

# Low variance data
y_true_low = [48, 49, 50, 51, 52]
y_pred_low = [47, 50, 51, 50, 53]

# High variance data  
y_true_high = [10, 30, 50, 70, 90]
y_pred_high = [9, 31, 51, 69, 91]

# Same absolute errors!
print(f"Low variance R²:  {r2_score(y_true_low, y_pred_low):.4f}")
print(f"High variance R²: {r2_score(y_true_high, y_pred_high):.4f}")

Output:

Low variance R²:  0.5000
High variance R²:  0.9975

Same prediction quality, wildly different R²! The high-variance data has more "easy variance" to explain.

Lie 3: R² Can't Detect Bias

# Model that's systematically off by a constant

y_true = [100, 200, 300, 400, 500]
y_biased = [150, 250, 350, 450, 550]  # Always +50 too high!

r2 = r2_score(y_true, y_biased)
print(f"R²: {r2:.4f}")  # Still looks great!
print(f"All predictions are $50 too high!")

Output:

R²: 1.0000
All predictions are $50 too high!

Perfect R² despite being wrong by $50 every time! R² measures correlation, not absolute accuracy.

When to Use R²

✅ Use R² When:

1. Comparing models on the same dataset

models = [LinearRegression(), Ridge(), Lasso(), RandomForestRegressor()]

for model in models:
    model.fit(X_train, y_train)
    r2 = r2_score(y_test, model.predict(X_test))
    print(f"{model.__class__.__name__}: R² = {r2:.4f}")

2. Quick sanity check — "Is my model doing anything?"

if r2 < 0:
    print("🚨 Model is WORSE than baseline! Something is wrong.")
elif r2 < 0.3:
    print("⚠️ Model is weak. Need better features or different approach.")
else:
    print("✓ Model has some predictive power.")

3. Communicating explained variance to stakeholders

"Our model explains 78% of the variation in house prices."

❌ Don't Use R² When:

1. Comparing across different datasets

# ❌ WRONG
"Model A on Dataset X: R² = 0.90"
"Model B on Dataset Y: R² = 0.75"
"Model A is better!"

# ✅ RIGHT
# Datasets have different inherent variance!
# Only compare models on the SAME data

2. When absolute error magnitude matters

# Use MSE, MAE, or RMSE instead
from sklearn.metrics import mean_absolute_error, mean_squared_error

mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
print(f"On average, predictions are off by ${mae:.2f}")

3. When you need to detect bias

# Check for systematic over/under prediction
residuals = y_true - y_pred
mean_residual = np.mean(residuals)
print(f"Average residual: {mean_residual:.2f}")  # Should be ~0

Complete Example: Diagnosing Model Problems with R²

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

# Generate data with known pattern
np.random.seed(42)
n = 500
X = np.random.randn(n, 5)
y = 3*X[:, 0] + 2*X[:, 1]**2 - X[:, 2] + np.random.randn(n) * 2

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Test different models
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree (overfit)': DecisionTreeRegressor(max_depth=None),
    'Decision Tree (tuned)': DecisionTreeRegressor(max_depth=5),
    'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=10)
}

print("Model Diagnosis")
print("=" * 65)
print(f"{'Model':<25} {'Train R²':>12} {'Test R²':>12} {'Status':>15}")
print("-" * 65)

for name, model in models.items():
    model.fit(X_train, y_train)

    train_r2 = r2_score(y_train, model.predict(X_train))
    test_r2 = r2_score(y_test, model.predict(X_test))

    # Diagnose
    if train_r2 > 0.95 and test_r2 < train_r2 - 0.2:
        status = "⚠️ Overfit!"
    elif test_r2 < 0:
        status = "🚨 Harmful!"
    elif test_r2 < 0.3:
        status = "😐 Weak"
    elif test_r2 > 0.7:
        status = "✓ Good"
    else:
        status = "👍 Okay"

    print(f"{name:<25} {train_r2:>12.4f} {test_r2:>12.4f} {status:>15}")

Output:

Model Diagnosis
=================================================================
Model                         Train R²      Test R²          Status
-----------------------------------------------------------------
Linear Regression               0.6234       0.5891           👍 Okay
Decision Tree (overfit)         1.0000       0.4521        ⚠️ Overfit!
Decision Tree (tuned)           0.7823       0.7234           ✓ Good
Random Forest                   0.9456       0.8123           ✓ Good

Insights:

Linear Regression can't capture the X²term (nonlinearity)
Deep Decision Tree memorizes training data (R²=1) but fails on test
Tuned tree and Random Forest generalize well

Quick Reference

The Formula

R² = 1 - (SS_res / SS_tot)
   = 1 - (Σ(y - ŷ)² / Σ(y - ȳ)²)
   = 1 - (Your errors / Baseline errors)

Interpretation Scale

R² Value	Meaning	Action
1.0	Perfect	Suspicious! Check for leakage
0.9+	Excellent	Great model
0.7-0.9	Good	Solid performance
0.5-0.7	Moderate	Room for improvement
0.3-0.5	Weak	Need better features
0-0.3	Poor	Rethink approach
< 0	Harmful	Model is worse than baseline!

Negative R² Causes

Cause	Fix
Overfitting	Reduce model complexity, regularize
Wrong model type	Try different algorithms
Bad features	Feature engineering
Data issues	Check for errors, outliers
Test distribution shift	Ensure train/test similarity

Key Takeaways

R² = 1 - (Your Error / Baseline Error) — Measures improvement over predicting the mean
R² = 0 means no better than guessing average — Model has no predictive power
R² < 0 means WORSE than guessing average — Model is actively harmful
Negative R² usually means overfitting or wrong model — Something is fundamentally broken
High R² doesn't mean good predictions — Could have high variance data or systematic bias
Use Adjusted R² when comparing models with different feature counts — Penalizes unnecessary complexity
Don't compare R² across datasets — Variance differences make it meaningless
Always check train AND test R² — Big gap = overfitting

The One-Sentence Summary

R-squared asks "Is your dart-throwing strategy better than just aiming at the center every time?" — if R² is positive you're doing something right, if it's zero you might as well guess the average, and if it's negative your "strategy" is somehow making things WORSE than not trying at all.

What's Next?

Now that you understand R-squared, you're ready for:

MSE, RMSE, MAE — Absolute error metrics for regression
MAPE — Percentage-based error metrics
Residual Analysis — Diagnosing model problems visually
Cross-Validation for Regression — Robust model evaluation

Follow me for the next article in this series!

Let's Connect!

If negative R² finally makes sense now, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

What's the worst (most negative) R² you've ever seen? I once hit -47 due to a unit conversion bug. The model was predicting in cents while targets were in dollars! 😅

The difference between a model that explains 85% of housing price variance and one that's somehow WORSE than just guessing "$400,000" for every house? R-squared. When it goes negative, your sophisticated model is being outperformed by someone who doesn't even know what machine learning is.

Share this with someone who's never seen negative R². When they do, they'll know exactly what went wrong.

Happy regressing! 🎯

DEV Community

R-Squared Explained: The Dart Player Who's Somehow WORSE Than Just Aiming at the Center

The Darts Tournament of Prediction

Player 1: "The Baseline" Barry

Player 2: "The Predictor" Paula

Player 3: "The Overthinker" Oliver

The Scorecard: R-Squared

The Formal Definition

Visual Intuition

Computing R-Squared in Python

The R-Squared Scale

What Does NEGATIVE R-Squared Mean?

Scenario 1: Overfitting on Training Data

Scenario 2: Wrong Model for the Data

Scenario 3: Predicting the Wrong Thing

Why Negative R² Happens: The Math

R² vs Other Regression Metrics

Adjusted R-Squared: The Honest Version

Solution: Adjusted R²

When R² Lies

Lie 1: High R² Doesn't Mean Good Predictions

Lie 2: R² Depends on Data Variance

Lie 3: R² Can't Detect Bias

When to Use R²

✅ Use R² When:

❌ Don't Use R² When:

Complete Example: Diagnosing Model Problems with R²

Quick Reference

The Formula

Interpretation Scale

Negative R² Causes

Key Takeaways

The One-Sentence Summary

What's Next?

Let's Connect!

Top comments (0)