Mubarak Mohamed

Posted on Aug 28

Preparing Your Data for the ARIMA Model: The Secret Step to Reliable Forecasts

#python #beginners #programming #machinelearning

Before making predictions, we need to make sure our data is ready.
A raw time series often contains trends or fluctuations that can mislead a forecasting model.

The ARIMA model has one key requirement: it only works properly with stationary series.

A stationary series is one whose statistical properties (mean, variance, autocorrelation) remain stable over time.
A non-stationary series, on the other hand, changes significantly (for example, with a strong trend or seasonality). Without this preparation, ARIMA may produce biased or unreliable forecasts.

In the previous article (How Time Series Reveal the Future: An Introduction to the ARIMA Model), we explored what a time series is, its components (trend, seasonality, noise), and the intuition behind ARIMA.
We also visualized the AirPassengers dataset, which showed a steady upward trend and yearly seasonality.

👉 But for ARIMA to work, our data must satisfy one key condition: stationarity.
That’s exactly what this article is about: transforming a non-stationary series into a stationary one using simple techniques (differencing, statistical tests).
In other words: after observing, we now move on to preparing.

Simplified Theory

What is stationarity? A stationary series is one whose statistical properties (mean, variance, autocorrelation) remain stable over time. 👉 Example: daily winter temperatures in a city (around a stable mean with small fluctuations).

A non-stationary series changes too much over time:

*Trend *(e.g., constant increase in smartphone sales).
*Seasonality *(e.g., ice cream sales peaking every summer).

ARIMA assumes the series is stationary: otherwise, it “believes” past trends will continue forever, leading to biased forecasts.

Differencing To make a series stationary, we use differencing: Yt′=Yt−Yt−1Y't = Y_t - Y{t-1} In other words, each value is replaced by the *change between two successive periods. *
This removes linear trends.
For strong seasonality, we can apply seasonal differencing (e.g., difference with the value one year before).

Example: instead of analyzing raw monthly sales, we analyze the month-to-month change.

Statistical tests (ADF & KPSS) To check if a series is stationary, we use two complementary tests: ADF (Augmented Dickey-Fuller Test)
Null hypothesis (H₀): the series is non-stationary.
If p-value < 0.05 → reject H₀ → the series is stationary. KPSS (Kwiatkowski-Phillips-Schmidt-Shin Test)
Null hypothesis (H₀): the series is stationary.
If p-value < 0.05 → reject H₀ → the series is non-stationary.

In practice:

We apply both tests for robustness.
If ADF and KPSS disagree, we refine with additional transformations.

Hands-on in Python

We’ll use a simple time series dataset: the annual flow of the Nile River (built into statsmodels).

import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.datasets import nile

# Load Nile dataset
data = nile.load_pandas().data
data.index = pd.date_range(start="1871", periods=len(data), freq="Y")
series = data['volume']

# Plot series
plt.figure(figsize=(10,4))
plt.plot(series)
plt.title("Annual Nile River Flow (1871–1970)")
plt.show()

Check stationarity with ADF & KPSS

def adf_test(series):
    result = adfuller(series, autolag='AIC')
    print("ADF Statistic:", result[0])
    print("p-value:", result[1])
    if result[1] < 0.05:
        print("✅ The series is stationary (ADF).")
    else:
        print("❌ The series is NON-stationary (ADF).")

def kpss_test(series):
    result = kpss(series, regression='c', nlags="auto")
    print("KPSS Statistic:", result[0])
    print("p-value:", result[1])
    if result[1] < 0.05:
        print("❌ The series is NON-stationary (KPSS).")
    else:
        print("✅ The series is stationary (KPSS).")

print("ADF Test")
adf_test(series)
print("\nKPSS Test")
kpss_test(series)

Apply differencing

series_diff = series.diff().dropna()

plt.figure(figsize=(10,4))
plt.plot(series_diff)
plt.title("Differenced series (1st difference)")
plt.show()

print("ADF Test after differencing")
adf_test(series_diff)
print("\nKPSS Test after differencing")
kpss_test(series_diff)

Understanding ARIMA and its parameters

The ARIMA(p, d, q) model combines three parts:

AR (AutoRegressive, p)
Uses past values to predict the future.
Example: if
𝑝 = 2
p=2, the current value depends on the last 2 values.
Formula: Yt=ϕ1Yt−1+ϕ2Yt−2+ϵtY_t = \phi_1 Y_{t-1} + \phi_2 Y_{t-2} + \epsilon_t
I (Integrated, d)
Number of differences applied to make the series stationary.
Example: d=0d = 0 → no differencing.
d=1d = 1 → one difference applied.
MA (Moving Average, q)
Uses past errors (residuals) for prediction.
Example: if 𝑞 = 1, the prediction depends on the last error.
Formula: Yt=θ1ϵt−1+θ2ϵt−2+ϵtY_t = \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \epsilon_t

In short:

p = past values memory
d = differencing degree
q = past errors memory

Example in Python

from statsmodels.tsa.arima.model import ARIMA

# ARIMA(1,1,1)
model = ARIMA(series, order=(1,1,1))
fit = model.fit()

print(fit.summary())

Typical output:

AR (p) → past values effect
I (d) → differencing applied
MA (q) → past errors effect
AIC/BIC → model quality (lower = better)

Choosing the best parameters (p,d,q)

One of the main challenges with ARIMA is selecting the right p, d, q.

Choosing the best parameters (p,d,q)

Choosing p and q with ACF & PACF

ACF → helps to choose q (MA part).
PACF → helps to choose p (AR part).

Simple rules:

PACF cutoff → good candidate for p.
ACF cutoff → good candidate for q.

DEV Community

Preparing Your Data for the ARIMA Model: The Secret Step to Reliable Forecasts

Simplified Theory

Hands-on in Python

Understanding ARIMA and its parameters

Choosing the best parameters (p,d,q)

Choosing the best parameters (p,d,q)

Top comments (0)