DEV Community

Cover image for R-Squared Explained: The Dart Player Who's Somehow WORSE Than Just Aiming at the Center
Sachin Kr. Rajput
Sachin Kr. Rajput

Posted on

R-Squared Explained: The Dart Player Who's Somehow WORSE Than Just Aiming at the Center

The One-Line Summary: R-squared measures how much better your model is than just predicting the average every time. R² = 1 means perfect predictions. R² = 0 means you're no better than guessing the average. R² < 0 means you're somehow WORSE than guessing the average — your model is actively harmful.


The Darts Tournament of Prediction

Three contestants enter the "Predict the Target" darts championship.

The rules are simple: A target number appears (like 73, 45, 91, 28...). You throw a dart at a number line. Closest to the target wins.

Over 100 rounds, the targets ranged from 0 to 100, with an average of 50.


Player 1: "The Baseline" Barry

Barry's strategy: Always throw at 50 (the average).

Round 1: Target = 73    Barry throws: 50    Error: 23
Round 2: Target = 45    Barry throws: 50    Error: 5
Round 3: Target = 91    Barry throws: 50    Error: 41
Round 4: Target = 28    Barry throws: 50    Error: 22
...

Barry's total squared error: 25,000
Enter fullscreen mode Exit fullscreen mode

Barry never tries to predict. He just throws at the center every time. Boring, but consistent.


Player 2: "The Predictor" Paula

Paula studies patterns. She notices the targets follow a pattern based on the time of day, previous numbers, and moon phase (okay, maybe not moon phase).

Round 1: Target = 73    Paula throws: 70    Error: 3
Round 2: Target = 45    Paula throws: 48    Error: 3
Round 3: Target = 91    Paula throws: 85    Error: 6
Round 4: Target = 28    Paula throws: 35    Error: 7
...

Paula's total squared error: 3,750
Enter fullscreen mode Exit fullscreen mode

Paula's errors are MUCH smaller. Her "model" (pattern recognition) works!


Player 3: "The Overthinker" Oliver

Oliver built a complex system with 47 variables, three neural networks, and a ouija board. He's CERTAIN it's superior.

Round 1: Target = 73    Oliver throws: 12    Error: 61
Round 2: Target = 45    Oliver throws: 88    Error: 43
Round 3: Target = 91    Oliver throws: 15    Error: 76
Round 4: Target = 28    Oliver throws: 95    Error: 67
...

Oliver's total squared error: 42,000
Enter fullscreen mode Exit fullscreen mode

Oliver's "sophisticated" system is a DISASTER. He's not just wrong — he's wronger than Barry who doesn't even try!


The Scorecard: R-Squared

R² = 1 - (Your Error / Baseline Error)

Barry (Baseline):
R² = 1 - (25,000 / 25,000) = 1 - 1 = 0
"Exactly as good as guessing the average"

Paula (The Predictor):
R² = 1 - (3,750 / 25,000) = 1 - 0.15 = 0.85
"85% better than guessing the average!"

Oliver (The Overthinker):
R² = 1 - (42,000 / 25,000) = 1 - 1.68 = -0.68
"68% WORSE than guessing the average!"
Enter fullscreen mode Exit fullscreen mode

This is R-squared.

It answers one question: "Is your model better than just predicting the average?"

  • R² = 1: Perfect predictions
  • R² = 0.85: You explain 85% of the variance
  • R² = 0: Your model equals the "just guess average" baseline
  • R² = -0.68: Your model is WORSE than guessing the average

The Formal Definition

R² = 1 - (SS_res / SS_tot)

Where:
  SS_res = Σ(actual - predicted)²    ← Your model's squared errors
  SS_tot = Σ(actual - mean)²         ← Baseline's squared errors (variance)
Enter fullscreen mode Exit fullscreen mode

In plain English:

R² = 1 - (How wrong YOU are / How wrong the AVERAGE would be)
Enter fullscreen mode Exit fullscreen mode

Visual Intuition

BASELINE MODEL (Predict Mean = 50 for everything):

Value
100 │              ●
    │         ●
 75 │    ●              ●
    │
 50 │════════════════════════  ← Baseline: Always predict 50
    │              ●
 25 │       ●
    │                    ●
  0 └─────────────────────────

    Errors = distances from all points to line at 50


YOUR MODEL (Actually predicts):

Value
100 │              ●
    │         ●  ╱
 75 │    ●    ╱        ●
    │      ╱      ╱
 50 │    ╱    ╱
    │  ╱   ╱       ●
 25 │ ╱  ●
    │╱               ●
  0 └─────────────────────────

    Errors = distances from points to your prediction line


R² asks: Are YOUR errors smaller than BASELINE errors?
Enter fullscreen mode Exit fullscreen mode

Computing R-Squared in Python

import numpy as np
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Generate some data
np.random.seed(42)
X = np.random.randn(100, 3)
y = 3*X[:, 0] + 2*X[:, 1] - X[:, 2] + np.random.randn(100) * 0.5

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Fit model
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Calculate R²
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.4f}")

# Manual calculation to understand
mean_y = np.mean(y_test)
ss_tot = np.sum((y_test - mean_y) ** 2)      # Baseline error
ss_res = np.sum((y_test - y_pred) ** 2)      # Model error
r2_manual = 1 - (ss_res / ss_tot)
print(f"R² (manual): {r2_manual:.4f}")
Enter fullscreen mode Exit fullscreen mode

Output:

R² Score: 0.9512
R² (manual): 0.9512
Enter fullscreen mode Exit fullscreen mode

Interpretation: The model explains 95.12% of the variance in y. Only 4.88% is unexplained (noise, missing features, etc.).


The R-Squared Scale

R² = 1.0    Perfect predictions (suspicious - probably overfitting!)
    │
R² = 0.9    Excellent - explains 90% of variance
    │
R² = 0.7    Good - explains 70% of variance  
    │
R² = 0.5    Moderate - explains 50% of variance
    │
R² = 0.3    Weak - only explains 30% of variance
    │
R² = 0.0    Useless - no better than predicting the mean
    │
R² = -0.5   HARMFUL - 50% worse than predicting the mean!
    │
R² → -∞     Catastrophically bad - your model is sabotage
Enter fullscreen mode Exit fullscreen mode

What Does NEGATIVE R-Squared Mean?

This is where people get confused. How can you be WORSE than guessing the average?

It happens when your model's predictions are so bad that you'd be better off ignoring it entirely.

Scenario 1: Overfitting on Training Data

from sklearn.tree import DecisionTreeRegressor

# Overfit a tree (no max_depth = memorize training data)
overfit_model = DecisionTreeRegressor(max_depth=None)
overfit_model.fit(X_train, y_train)

# Training R² - looks amazing!
train_pred = overfit_model.predict(X_train)
train_r2 = r2_score(y_train, train_pred)
print(f"Training R²: {train_r2:.4f}")  # 1.0 - perfect!

# Test R² - disaster!
test_pred = overfit_model.predict(X_test)
test_r2 = r2_score(y_test, test_pred)
print(f"Test R²: {test_r2:.4f}")  # Could be negative!
Enter fullscreen mode Exit fullscreen mode

Output:

Training R²: 1.0000
Test R²: 0.2341  (or worse, could be negative!)
Enter fullscreen mode Exit fullscreen mode

The model memorized training data patterns that don't generalize.


Scenario 2: Wrong Model for the Data

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Data with a clear NON-LINEAR pattern
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = np.sin(X).ravel() * 10 + np.random.randn(100) * 0.5

# Fit a LINEAR model to NON-LINEAR data
linear_model = LinearRegression()
linear_model.fit(X, y)
y_pred = linear_model.predict(X)

r2 = r2_score(y, y_pred)
print(f"R² for linear model on sine wave: {r2:.4f}")
Enter fullscreen mode Exit fullscreen mode

Output:

R² for linear model on sine wave: -0.0234
Enter fullscreen mode Exit fullscreen mode

Negative! A straight line through a sine wave is WORSE than just guessing the average (zero).

  y
 10│    ╭──╮        ╭──╮
   │  ╭─╯  ╰─╮    ╭─╯  ╰─╮
  0│──╯      ╰────╯      ╰───  ← Sine wave data
   │         _______________
   │        ╱               ╲  ← Linear model (wrong!)
-10│      ╱                   ╲
   └────────────────────────────

The line misses the pattern entirely!
Mean prediction would be closer to most points.
Enter fullscreen mode Exit fullscreen mode

Scenario 3: Predicting the Wrong Thing

# Predicting house price, but model learned to predict... something else?
# Buggy feature engineering, wrong target variable, data leakage then fix, etc.

y_true = [300000, 450000, 275000, 500000, 350000]
y_pred = [100, 200, 150, 250, 175]  # Model predicting in wrong units!

r2 = r2_score(y_true, y_pred)
print(f"R²: {r2:.4f}")
Enter fullscreen mode Exit fullscreen mode

Output:

R²: -32847382.47
Enter fullscreen mode Exit fullscreen mode

Massively negative! The predictions aren't even in the same universe as the targets.


Why Negative R² Happens: The Math

R² = 1 - (SS_res / SS_tot)

For R² to be negative:
  SS_res / SS_tot > 1
  SS_res > SS_tot

Meaning:
  Your model's squared errors > Baseline's squared errors
  Your errors > "Just guess mean" errors

You're ADDING error compared to doing nothing!
Enter fullscreen mode Exit fullscreen mode

Visual:

Baseline errors (guess mean):         Your model errors:

    ●                                     ●
    │←────→│                              │←──────────────────→│
    ●      │                              ●                    │
    │←──→│ │                              │←────────→│         │
         Mean                                   Your bad prediction

If your arrows are LONGER than baseline arrows, R² goes negative!
Enter fullscreen mode Exit fullscreen mode

R² vs Other Regression Metrics

import numpy as np
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

y_true = np.array([100, 150, 200, 250, 300])

# Good model
y_good = np.array([105, 145, 210, 245, 295])

# Bad model (worse than mean)
y_bad = np.array([300, 100, 300, 100, 300])

# Mean prediction (baseline)
y_mean = np.array([200, 200, 200, 200, 200])

print("Good Model:")
print(f"  R²:   {r2_score(y_true, y_good):.4f}")
print(f"  MSE:  {mean_squared_error(y_true, y_good):.2f}")
print(f"  MAE:  {mean_absolute_error(y_true, y_good):.2f}")

print("\nBad Model:")
print(f"  R²:   {r2_score(y_true, y_bad):.4f}")
print(f"  MSE:  {mean_squared_error(y_true, y_bad):.2f}")
print(f"  MAE:  {mean_absolute_error(y_true, y_bad):.2f}")

print("\nBaseline (Always Predict Mean):")
print(f"  R²:   {r2_score(y_true, y_mean):.4f}")
print(f"  MSE:  {mean_squared_error(y_true, y_mean):.2f}")
print(f"  MAE:  {mean_absolute_error(y_true, y_mean):.2f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Good Model:
  R²:   0.9900
  MSE:  50.00
  MAE:  6.00

Bad Model:
  R²:   -3.5000
  MSE:  22500.00
  MAE:  140.00

Baseline (Always Predict Mean):
  R²:   0.0000
  MSE:  5000.00
  MAE:  60.00
Enter fullscreen mode Exit fullscreen mode

Key insights:

Metric What it tells you
How good vs. baseline (relative)
MSE Average squared error magnitude
MAE Average absolute error magnitude

R² is relative — it compares to baseline.
MSE/MAE are absolute — raw error magnitudes.


Adjusted R-Squared: The Honest Version

Problem: R² always increases (or stays same) when you add more features, even useless ones!

from sklearn.metrics import r2_score
import numpy as np

# True relationship: y = x1 + noise
np.random.seed(42)
n = 100
X1 = np.random.randn(n)
y = X1 + np.random.randn(n) * 0.5

# Add useless random features
X_noise = np.random.randn(n, 10)
X_all = np.column_stack([X1, X_noise])

# R² keeps increasing with more (useless) features!
from sklearn.linear_model import LinearRegression

print("Features | R² (train)")
print("-" * 25)
for n_features in [1, 3, 5, 7, 11]:
    X_subset = X_all[:, :n_features]
    model = LinearRegression().fit(X_subset, y)
    r2 = r2_score(y, model.predict(X_subset))
    print(f"    {n_features}    |   {r2:.4f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Features | R² (train)
-------------------------
    1    |   0.7823
    3    |   0.7891
    5    |   0.7954
    7    |   0.8012
    11   |   0.8134
Enter fullscreen mode Exit fullscreen mode

R² went UP even though we added garbage features!


Solution: Adjusted R²

Adjusted R² penalizes adding features:

Adjusted R² = 1 - (1 - R²) × (n - 1) / (n - p - 1)

Where:
  n = number of samples
  p = number of features
Enter fullscreen mode Exit fullscreen mode
def adjusted_r2(r2, n, p):
    """Calculate adjusted R-squared."""
    return 1 - (1 - r2) * (n - 1) / (n - p - 1)

n = 100
print("\nFeatures | R² (train) | Adjusted R²")
print("-" * 40)
for n_features in [1, 3, 5, 7, 11]:
    X_subset = X_all[:, :n_features]
    model = LinearRegression().fit(X_subset, y)
    r2 = r2_score(y, model.predict(X_subset))
    adj_r2 = adjusted_r2(r2, n, n_features)
    print(f"    {n_features}     |   {r2:.4f}    |    {adj_r2:.4f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Features | R² (train) | Adjusted R²
----------------------------------------
    1     |   0.7823    |    0.7801
    3     |   0.7891    |    0.7825
    5     |   0.7954    |    0.7845
    7     |   0.8012    |    0.7860
    11    |   0.8134    |    0.7900
Enter fullscreen mode Exit fullscreen mode

Adjusted R² barely changed (or even decreased) despite adding features. It's more honest!


When R² Lies

Lie 1: High R² Doesn't Mean Good Predictions

# Perfect R² but predictions are useless in practice
y_true = [1000000, 1000001, 1000002, 1000003, 1000004]
y_pred = [1000000.1, 1000001.2, 1000001.9, 1000003.1, 1000003.9]

r2 = r2_score(y_true, y_pred)
mae = mean_absolute_error(y_true, y_pred)

print(f"R²: {r2:.4f}")  # Very high!
print(f"MAE: {mae:.2f}")  # Errors might still be unacceptable
Enter fullscreen mode Exit fullscreen mode

Output:

R²: 0.9950
MAE: 0.14
Enter fullscreen mode Exit fullscreen mode

High R², but if you're predicting dollars and need to be within $0.01, this model fails!


Lie 2: R² Depends on Data Variance

# Same model performance, different R² due to data variance

# Low variance data
y_true_low = [48, 49, 50, 51, 52]
y_pred_low = [47, 50, 51, 50, 53]

# High variance data  
y_true_high = [10, 30, 50, 70, 90]
y_pred_high = [9, 31, 51, 69, 91]

# Same absolute errors!
print(f"Low variance R²:  {r2_score(y_true_low, y_pred_low):.4f}")
print(f"High variance R²: {r2_score(y_true_high, y_pred_high):.4f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Low variance R²:  0.5000
High variance R²:  0.9975
Enter fullscreen mode Exit fullscreen mode

Same prediction quality, wildly different R²! The high-variance data has more "easy variance" to explain.


Lie 3: R² Can't Detect Bias

# Model that's systematically off by a constant

y_true = [100, 200, 300, 400, 500]
y_biased = [150, 250, 350, 450, 550]  # Always +50 too high!

r2 = r2_score(y_true, y_biased)
print(f"R²: {r2:.4f}")  # Still looks great!
print(f"All predictions are $50 too high!")
Enter fullscreen mode Exit fullscreen mode

Output:

R²: 1.0000
All predictions are $50 too high!
Enter fullscreen mode Exit fullscreen mode

Perfect R² despite being wrong by $50 every time! R² measures correlation, not absolute accuracy.


When to Use R²

✅ Use R² When:

1. Comparing models on the same dataset

models = [LinearRegression(), Ridge(), Lasso(), RandomForestRegressor()]

for model in models:
    model.fit(X_train, y_train)
    r2 = r2_score(y_test, model.predict(X_test))
    print(f"{model.__class__.__name__}: R² = {r2:.4f}")
Enter fullscreen mode Exit fullscreen mode

2. Quick sanity check — "Is my model doing anything?"

if r2 < 0:
    print("🚨 Model is WORSE than baseline! Something is wrong.")
elif r2 < 0.3:
    print("⚠️ Model is weak. Need better features or different approach.")
else:
    print("✓ Model has some predictive power.")
Enter fullscreen mode Exit fullscreen mode

3. Communicating explained variance to stakeholders

"Our model explains 78% of the variation in house prices."


❌ Don't Use R² When:

1. Comparing across different datasets

# ❌ WRONG
"Model A on Dataset X: R² = 0.90"
"Model B on Dataset Y: R² = 0.75"
"Model A is better!"

# ✅ RIGHT
# Datasets have different inherent variance!
# Only compare models on the SAME data
Enter fullscreen mode Exit fullscreen mode

2. When absolute error magnitude matters

# Use MSE, MAE, or RMSE instead
from sklearn.metrics import mean_absolute_error, mean_squared_error

mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
print(f"On average, predictions are off by ${mae:.2f}")
Enter fullscreen mode Exit fullscreen mode

3. When you need to detect bias

# Check for systematic over/under prediction
residuals = y_true - y_pred
mean_residual = np.mean(residuals)
print(f"Average residual: {mean_residual:.2f}")  # Should be ~0
Enter fullscreen mode Exit fullscreen mode

Complete Example: Diagnosing Model Problems with R²

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

# Generate data with known pattern
np.random.seed(42)
n = 500
X = np.random.randn(n, 5)
y = 3*X[:, 0] + 2*X[:, 1]**2 - X[:, 2] + np.random.randn(n) * 2

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Test different models
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree (overfit)': DecisionTreeRegressor(max_depth=None),
    'Decision Tree (tuned)': DecisionTreeRegressor(max_depth=5),
    'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=10)
}

print("Model Diagnosis")
print("=" * 65)
print(f"{'Model':<25} {'Train R²':>12} {'Test R²':>12} {'Status':>15}")
print("-" * 65)

for name, model in models.items():
    model.fit(X_train, y_train)

    train_r2 = r2_score(y_train, model.predict(X_train))
    test_r2 = r2_score(y_test, model.predict(X_test))

    # Diagnose
    if train_r2 > 0.95 and test_r2 < train_r2 - 0.2:
        status = "⚠️ Overfit!"
    elif test_r2 < 0:
        status = "🚨 Harmful!"
    elif test_r2 < 0.3:
        status = "😐 Weak"
    elif test_r2 > 0.7:
        status = "✓ Good"
    else:
        status = "👍 Okay"

    print(f"{name:<25} {train_r2:>12.4f} {test_r2:>12.4f} {status:>15}")
Enter fullscreen mode Exit fullscreen mode

Output:

Model Diagnosis
=================================================================
Model                         Train R²      Test R²          Status
-----------------------------------------------------------------
Linear Regression               0.6234       0.5891           👍 Okay
Decision Tree (overfit)         1.0000       0.4521        ⚠️ Overfit!
Decision Tree (tuned)           0.7823       0.7234           ✓ Good
Random Forest                   0.9456       0.8123           ✓ Good
Enter fullscreen mode Exit fullscreen mode

Insights:

  • Linear Regression can't capture the X²term (nonlinearity)
  • Deep Decision Tree memorizes training data (R²=1) but fails on test
  • Tuned tree and Random Forest generalize well

Quick Reference

The Formula

R² = 1 - (SS_res / SS_tot)
   = 1 - (Σ(y - ŷ)² / Σ(y - ȳ)²)
   = 1 - (Your errors / Baseline errors)
Enter fullscreen mode Exit fullscreen mode

Interpretation Scale

R² Value Meaning Action
1.0 Perfect Suspicious! Check for leakage
0.9+ Excellent Great model
0.7-0.9 Good Solid performance
0.5-0.7 Moderate Room for improvement
0.3-0.5 Weak Need better features
0-0.3 Poor Rethink approach
< 0 Harmful Model is worse than baseline!

Negative R² Causes

Cause Fix
Overfitting Reduce model complexity, regularize
Wrong model type Try different algorithms
Bad features Feature engineering
Data issues Check for errors, outliers
Test distribution shift Ensure train/test similarity

Key Takeaways

  1. R² = 1 - (Your Error / Baseline Error) — Measures improvement over predicting the mean

  2. R² = 0 means no better than guessing average — Model has no predictive power

  3. R² < 0 means WORSE than guessing average — Model is actively harmful

  4. Negative R² usually means overfitting or wrong model — Something is fundamentally broken

  5. High R² doesn't mean good predictions — Could have high variance data or systematic bias

  6. Use Adjusted R² when comparing models with different feature counts — Penalizes unnecessary complexity

  7. Don't compare R² across datasets — Variance differences make it meaningless

  8. Always check train AND test R² — Big gap = overfitting


The One-Sentence Summary

R-squared asks "Is your dart-throwing strategy better than just aiming at the center every time?" — if R² is positive you're doing something right, if it's zero you might as well guess the average, and if it's negative your "strategy" is somehow making things WORSE than not trying at all.


What's Next?

Now that you understand R-squared, you're ready for:

  • MSE, RMSE, MAE — Absolute error metrics for regression
  • MAPE — Percentage-based error metrics
  • Residual Analysis — Diagnosing model problems visually
  • Cross-Validation for Regression — Robust model evaluation

Follow me for the next article in this series!


Let's Connect!

If negative R² finally makes sense now, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

What's the worst (most negative) R² you've ever seen? I once hit -47 due to a unit conversion bug. The model was predicting in cents while targets were in dollars! 😅


The difference between a model that explains 85% of housing price variance and one that's somehow WORSE than just guessing "$400,000" for every house? R-squared. When it goes negative, your sophisticated model is being outperformed by someone who doesn't even know what machine learning is.


Share this with someone who's never seen negative R². When they do, they'll know exactly what went wrong.

Happy regressing! 🎯

Top comments (0)