In This Article
- Why MAPE Misleads
- The Four-Metric Framework
- Python Implementation
- Residual Analysis and the Ljung-Box Test
- Retrain vs. Recalibrate Decision Table
Why MAPE Misleads
Mean Absolute Percentage Error is the default metric for forecast evaluation in most business contexts. It is easy to explain: if your MAPE is 8%, your model is wrong by 8% on average. That simplicity is also its critical flaw.
MAPE is undefined when actuals are zero — which happens constantly in revenue series with seasonal gaps, new product launches, or promotional periods. More subtly, it penalizes over-forecasts more severely than under-forecasts by construction: a 50% under-forecast has a maximum error contribution of 100%, while an over-forecast of equal magnitude can produce an error of 200% or more. This asymmetry means MAPE-optimized models systematically bias toward underestimating demand — a direction that is rarely operationally preferable.
The Core Problem
A model can have a low MAPE and still be useless in practice. If it is consistently wrong in the same direction, if its errors correlate with past errors, or if it performs worse than a naive benchmark, those failures are invisible in a single-metric MAPE report.
The Four-Metric Framework
A rigorous forecast evaluation requires at minimum four metrics, each measuring a different failure mode. Used together, they reveal whether a model is accurate in magnitude, unbiased, better than a naive baseline, and not systematically gaming a particular error measure.
- MAPE: What It Measures: Mean percentage error magnitude — Key Property: Intuitive but unstable at low actuals
- RMSE: What It Measures: Root mean squared error — Key Property: Penalizes large errors; same units as the series
- MASE: What It Measures: Mean absolute scaled error vs. seasonal naïve — Key Property: Scale-free; MASE > 1.0 means worse than naïve
- Theil's U: What It Measures: RMSE ratio vs. no-change naïve — Key Property: U > 1.0 means model is worse than doing nothing
Python Implementation
The function below computes all four metrics from actuals and forecasts arrays. MASE uses a seasonal naïve benchmark with a configurable seasonal_period — for monthly data the default of 12 compares each forecast to the value from the same month one year prior. When the series is shorter than one full season, it falls back to a one-step naïve benchmark.
import numpy as np
from typing import Dict
def compute_forecast_metrics(
actuals: np.ndarray,
forecasts: np.ndarray,
seasonal_period: int = 12,
epsilon: float = 1e-8
) -> Dict[str, float]:
errors = actuals - forecasts
abs_errors = np.abs(errors)
# MAPE — skip near-zero actuals to avoid division instability
mask = np.abs(actuals) > epsilon
mape = np.mean(abs_errors[mask] / np.abs(actuals[mask])) * 100
# RMSE
rmse = np.sqrt(np.mean(errors ** 2))
# MASE — seasonal naïve benchmark
if len(actuals) > seasonal_period:
naive_errors = np.abs(actuals[seasonal_period:] - actuals[:-seasonal_period])
else:
naive_errors = np.abs(np.diff(actuals)) # one-step naïve fallback
naive_mae = np.mean(naive_errors)
mase = np.mean(abs_errors) / (naive_mae + epsilon)
# Theil's U — compare model RMSE to no-change naïve RMSE
naive_rmse = np.sqrt(np.mean((actuals[1:] - actuals[:-1]) ** 2))
theil_u = rmse / (naive_rmse + epsilon)
return {
'mape': round(float(mape), 4),
'rmse': round(float(rmse), 4),
'mase': round(float(mase), 4),
'theil_u': round(float(theil_u), 4),
}
Residual Analysis and the Ljung-Box Test
A well-specified forecast model should produce residuals that are white noise: random, uncorrelated, and centered near zero. If residuals show autocorrelation — if this period's error predicts next period's error — the model is leaving systematic information on the table. That pattern is detectable and exploitable, which means the model is not doing its job.
The Ljung-Box test is the standard statistical tool for detecting residual autocorrelation. It tests the null hypothesis that residuals up to lag k are white noise. A p-value below 0.05 rejects that hypothesis and confirms the model has structural problems that cannot be patched by recalibration alone.
from statsmodels.stats.diagnostic import acorr_ljungbox
def residual_analysis(
actuals: np.ndarray,
forecasts: np.ndarray,
lags: int = 10,
significance: float = 0.05
) -> Dict:
residuals = actuals - forecasts
lb_result = acorr_ljungbox(residuals, lags=[lags], return_df=True)
lb_stat = float(lb_result['lb_stat'].iloc[-1])
lb_pval = float(lb_result['lb_pvalue'].iloc[-1])
autocorrelated = lb_pval < significance
residual_mean = float(np.mean(residuals))
residual_std = float(np.std(residuals))
max_abs_residual = float(np.max(np.abs(residuals)))
if autocorrelated and abs(residual_mean) > residual_std * 0.5:
diagnosis = "RETRAIN: systematic bias with autocorrelation"
elif autocorrelated:
diagnosis = "RETRAIN: autocorrelated residuals indicate model misspecification"
elif abs(residual_mean) > residual_std * 0.5:
diagnosis = "RECALIBRATE: bias without autocorrelation"
else:
diagnosis = "PASS: residuals appear well-behaved"
return {
'ljung_box_stat': round(lb_stat, 4),
'ljung_box_pvalue': round(lb_pval, 4),
'autocorrelated': autocorrelated,
'residual_mean': round(residual_mean, 4),
'residual_std': round(residual_std, 4),
'max_abs_residual': round(max_abs_residual, 4),
'diagnosis': diagnosis,
}
Retrain vs. Recalibrate Decision Table
Not every model failure requires a full retrain. Retraining means rebuilding the model from scratch on a new or expanded dataset — a significant undertaking for complex models. Recalibration means adjusting existing parameters, updating intercepts, or applying a bias correction factor. Knowing which intervention is appropriate requires reading the diagnostic signals together.
- MASE > 1.0: Recommended Action: Retrain — Rationale: Model underperforms a naïve baseline — structural failure
- Autocorrelated + bias: Recommended Action: Retrain — Rationale: Model is missing a systematic component; recalibration cannot fix this
- Non-autocorrelated + bias: Recommended Action: Recalibrate — Rationale: Model structure is correct; apply bias correction or update intercept
- All metrics passing: Recommended Action: Monitor — Rationale: Continue scheduled evaluation; no intervention needed
- Theil's U > 1.0 despite low MAPE: Recommended Action: Retrain — Rationale: Model exploits MAPE asymmetry; real-world performance is worse than reported
"A forecast model that passes its MAPE target while underperforming a naïve benchmark is not a model that works — it is a model that has learned to game a poorly chosen metric."
This post was originally published on White Oak Intelligence. Read the full article there for formatted diagrams, code examples, and related content.
Top comments (0)