Variance Testing in Forecasting

#datascience #machinelearning #python #analytics

In This Article

Why MAPE Misleads

Mean Absolute Percentage Error is the default metric for forecast evaluation in most business contexts. It is easy to explain: if your MAPE is 8%, your model is wrong by 8% on average. That simplicity is also its critical flaw.

MAPE is undefined when actuals are zero — which happens constantly in revenue series with seasonal gaps, new product launches, or promotional periods. More subtly, it penalizes over-forecasts more severely than under-forecasts by construction: a 50% under-forecast has a maximum error contribution of 100%, while an over-forecast of equal magnitude can produce an error of 200% or more. This asymmetry means MAPE-optimized models systematically bias toward underestimating demand — a direction that is rarely operationally preferable.

The Core Problem

A model can have a low MAPE and still be useless in practice. If it is consistently wrong in the same direction, if its errors correlate with past errors, or if it performs worse than a naive benchmark, those failures are invisible in a single-metric MAPE report.

The Four-Metric Framework

A rigorous forecast evaluation requires at minimum four metrics, each measuring a different failure mode. Used together, they reveal whether a model is accurate in magnitude, unbiased, better than a naive baseline, and not systematically gaming a particular error measure.

MAPE: What It Measures: Mean percentage error magnitude — Key Property: Intuitive but unstable at low actuals
RMSE: What It Measures: Root mean squared error — Key Property: Penalizes large errors; same units as the series
MASE: What It Measures: Mean absolute scaled error vs. seasonal naïve — Key Property: Scale-free; MASE > 1.0 means worse than naïve
Theil's U: What It Measures: RMSE ratio vs. no-change naïve — Key Property: U > 1.0 means model is worse than doing nothing

Python Implementation

The function below computes all four metrics from actuals and forecasts arrays. MASE uses a seasonal naïve benchmark with a configurable seasonal_period — for monthly data the default of 12 compares each forecast to the value from the same month one year prior. When the series is shorter than one full season, it falls back to a one-step naïve benchmark.

import numpy as np
from typing import Dict

def compute_forecast_metrics(
    actuals: np.ndarray,
    forecasts: np.ndarray,
    seasonal_period: int = 12,
    epsilon: float = 1e-8
) -> Dict[str, float]:

    errors = actuals - forecasts
    abs_errors = np.abs(errors)

    # MAPE — skip near-zero actuals to avoid division instability
    mask = np.abs(actuals) > epsilon
    mape = np.mean(abs_errors[mask] / np.abs(actuals[mask])) * 100

    # RMSE
    rmse = np.sqrt(np.mean(errors ** 2))

    # MASE — seasonal naïve benchmark
    if len(actuals) > seasonal_period:
        naive_errors = np.abs(actuals[seasonal_period:] - actuals[:-seasonal_period])
    else:
        naive_errors = np.abs(np.diff(actuals))  # one-step naïve fallback

    naive_mae = np.mean(naive_errors)
    mase = np.mean(abs_errors) / (naive_mae + epsilon)

    # Theil's U — compare model RMSE to no-change naïve RMSE
    naive_rmse = np.sqrt(np.mean((actuals[1:] - actuals[:-1]) ** 2))
    theil_u = rmse / (naive_rmse + epsilon)

    return {
        'mape':    round(float(mape), 4),
        'rmse':    round(float(rmse), 4),
        'mase':    round(float(mase), 4),
        'theil_u': round(float(theil_u), 4),
    }

Residual Analysis and the Ljung-Box Test

A well-specified forecast model should produce residuals that are white noise: random, uncorrelated, and centered near zero. If residuals show autocorrelation — if this period's error predicts next period's error — the model is leaving systematic information on the table. That pattern is detectable and exploitable, which means the model is not doing its job.

The Ljung-Box test is the standard statistical tool for detecting residual autocorrelation. It tests the null hypothesis that residuals up to lag k are white noise. A p-value below 0.05 rejects that hypothesis and confirms the model has structural problems that cannot be patched by recalibration alone.

from statsmodels.stats.diagnostic import acorr_ljungbox

def residual_analysis(
    actuals: np.ndarray,
    forecasts: np.ndarray,
    lags: int = 10,
    significance: float = 0.05
) -> Dict:

    residuals = actuals - forecasts
    lb_result = acorr_ljungbox(residuals, lags=[lags], return_df=True)
    lb_stat  = float(lb_result['lb_stat'].iloc[-1])
    lb_pval  = float(lb_result['lb_pvalue'].iloc[-1])
    autocorrelated = lb_pval < significance

    residual_mean    = float(np.mean(residuals))
    residual_std     = float(np.std(residuals))
    max_abs_residual = float(np.max(np.abs(residuals)))

    if autocorrelated and abs(residual_mean) > residual_std * 0.5:
        diagnosis = "RETRAIN: systematic bias with autocorrelation"
    elif autocorrelated:
        diagnosis = "RETRAIN: autocorrelated residuals indicate model misspecification"
    elif abs(residual_mean) > residual_std * 0.5:
        diagnosis = "RECALIBRATE: bias without autocorrelation"
    else:
        diagnosis = "PASS: residuals appear well-behaved"

    return {
        'ljung_box_stat':   round(lb_stat, 4),
        'ljung_box_pvalue': round(lb_pval, 4),
        'autocorrelated':   autocorrelated,
        'residual_mean':    round(residual_mean, 4),
        'residual_std':     round(residual_std, 4),
        'max_abs_residual': round(max_abs_residual, 4),
        'diagnosis':        diagnosis,
    }

Retrain vs. Recalibrate Decision Table

Not every model failure requires a full retrain. Retraining means rebuilding the model from scratch on a new or expanded dataset — a significant undertaking for complex models. Recalibration means adjusting existing parameters, updating intercepts, or applying a bias correction factor. Knowing which intervention is appropriate requires reading the diagnostic signals together.

MASE > 1.0: Recommended Action: Retrain — Rationale: Model underperforms a naïve baseline — structural failure
Autocorrelated + bias: Recommended Action: Retrain — Rationale: Model is missing a systematic component; recalibration cannot fix this
Non-autocorrelated + bias: Recommended Action: Recalibrate — Rationale: Model structure is correct; apply bias correction or update intercept
All metrics passing: Recommended Action: Monitor — Rationale: Continue scheduled evaluation; no intervention needed
Theil's U > 1.0 despite low MAPE: Recommended Action: Retrain — Rationale: Model exploits MAPE asymmetry; real-world performance is worse than reported

"A forecast model that passes its MAPE target while underperforming a naïve benchmark is not a model that works — it is a model that has learned to game a poorly chosen metric."

This post was originally published on White Oak Intelligence. Read the full article there for formatted diagrams, code examples, and related content.