Before making predictions, we need to make sure our data is ready.
A raw time series often contains trends or fluctuations that can mislead a forecasting model.
The ARIMA model has one key requirement: it only works properly with stationary series.
- A stationary series is one whose statistical properties (mean, variance, autocorrelation) remain stable over time.
- A non-stationary series, on the other hand, changes significantly (for example, with a strong trend or seasonality). Without this preparation, ARIMA may produce biased or unreliable forecasts.
In the previous article (How Time Series Reveal the Future: An Introduction to the ARIMA Model), we explored what a time series is, its components (trend, seasonality, noise), and the intuition behind ARIMA.
We also visualized the AirPassengers dataset, which showed a steady upward trend and yearly seasonality.
👉 But for ARIMA to work, our data must satisfy one key condition: stationarity.
That’s exactly what this article is about: transforming a non-stationary series into a stationary one using simple techniques (differencing, statistical tests).
In other words: after observing, we now move on to preparing.
Simplified Theory
- What is stationarity? A stationary series is one whose statistical properties (mean, variance, autocorrelation) remain stable over time. 👉 Example: daily winter temperatures in a city (around a stable mean with small fluctuations).
A non-stationary series changes too much over time:
- *Trend *(e.g., constant increase in smartphone sales).
- *Seasonality *(e.g., ice cream sales peaking every summer).
ARIMA assumes the series is stationary: otherwise, it “believes” past trends will continue forever, leading to biased forecasts.
- Differencing To make a series stationary, we use differencing: Yt′=Yt−Yt−1Y't = Y_t - Y{t-1} In other words, each value is replaced by the *change between two successive periods. *
- This removes linear trends.
- For strong seasonality, we can apply seasonal differencing (e.g., difference with the value one year before).
Example: instead of analyzing raw monthly sales, we analyze the month-to-month change.
- Statistical tests (ADF & KPSS) To check if a series is stationary, we use two complementary tests: ADF (Augmented Dickey-Fuller Test)
- Null hypothesis (H₀): the series is non-stationary.
- If p-value < 0.05 → reject H₀ → the series is stationary. KPSS (Kwiatkowski-Phillips-Schmidt-Shin Test)
- Null hypothesis (H₀): the series is stationary.
- If p-value < 0.05 → reject H₀ → the series is non-stationary.
In practice:
- We apply both tests for robustness.
- If ADF and KPSS disagree, we refine with additional transformations.
Hands-on in Python
We’ll use a simple time series dataset: the annual flow of the Nile River (built into statsmodels).
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.datasets import nile
# Load Nile dataset
data = nile.load_pandas().data
data.index = pd.date_range(start="1871", periods=len(data), freq="Y")
series = data['volume']
# Plot series
plt.figure(figsize=(10,4))
plt.plot(series)
plt.title("Annual Nile River Flow (1871–1970)")
plt.show()
Check stationarity with ADF & KPSS
def adf_test(series):
result = adfuller(series, autolag='AIC')
print("ADF Statistic:", result[0])
print("p-value:", result[1])
if result[1] < 0.05:
print("✅ The series is stationary (ADF).")
else:
print("❌ The series is NON-stationary (ADF).")
def kpss_test(series):
result = kpss(series, regression='c', nlags="auto")
print("KPSS Statistic:", result[0])
print("p-value:", result[1])
if result[1] < 0.05:
print("❌ The series is NON-stationary (KPSS).")
else:
print("✅ The series is stationary (KPSS).")
print("ADF Test")
adf_test(series)
print("\nKPSS Test")
kpss_test(series)
Apply differencing
series_diff = series.diff().dropna()
plt.figure(figsize=(10,4))
plt.plot(series_diff)
plt.title("Differenced series (1st difference)")
plt.show()
print("ADF Test after differencing")
adf_test(series_diff)
print("\nKPSS Test after differencing")
kpss_test(series_diff)
Understanding ARIMA and its parameters
The ARIMA(p, d, q) model combines three parts:
- AR (AutoRegressive, p)
- Uses past values to predict the future.
- Example: if
- 𝑝 = 2
- p=2, the current value depends on the last 2 values.
Formula: Yt=ϕ1Yt−1+ϕ2Yt−2+ϵtY_t = \phi_1 Y_{t-1} + \phi_2 Y_{t-2} + \epsilon_t
I (Integrated, d)
Number of differences applied to make the series stationary.
Example: d=0d = 0 → no differencing.
d=1d = 1 → one difference applied.
MA (Moving Average, q)
Uses past errors (residuals) for prediction.
Example: if 𝑞 = 1, the prediction depends on the last error.
Formula: Yt=θ1ϵt−1+θ2ϵt−2+ϵtY_t = \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \epsilon_t
In short:
- p = past values memory
- d = differencing degree
- q = past errors memory
Example in Python
from statsmodels.tsa.arima.model import ARIMA
# ARIMA(1,1,1)
model = ARIMA(series, order=(1,1,1))
fit = model.fit()
print(fit.summary())
Typical output:
- AR (p) → past values effect
- I (d) → differencing applied
- MA (q) → past errors effect
- AIC/BIC → model quality (lower = better)
Choosing the best parameters (p,d,q)
One of the main challenges with ARIMA is selecting the right p, d, q.
Choosing the best parameters (p,d,q)
Choosing p and q with ACF & PACF
- ACF → helps to choose q (MA part).
- PACF → helps to choose p (AR part).
Simple rules:
- PACF cutoff → good candidate for p.
- ACF cutoff → good candidate for q.
Top comments (0)