DEV Community

Cover image for Preparing Your Data for the ARIMA Model: The Secret Step to Reliable Forecasts
Mubarak Mohamed
Mubarak Mohamed

Posted on

Preparing Your Data for the ARIMA Model: The Secret Step to Reliable Forecasts

Before making predictions, we need to make sure our data is ready.
A raw time series often contains trends or fluctuations that can mislead a forecasting model.

The ARIMA model has one key requirement: it only works properly with stationary series.

  • A stationary series is one whose statistical properties (mean, variance, autocorrelation) remain stable over time.
  • A non-stationary series, on the other hand, changes significantly (for example, with a strong trend or seasonality). Without this preparation, ARIMA may produce biased or unreliable forecasts.

In the previous article (How Time Series Reveal the Future: An Introduction to the ARIMA Model), we explored what a time series is, its components (trend, seasonality, noise), and the intuition behind ARIMA.
We also visualized the AirPassengers dataset, which showed a steady upward trend and yearly seasonality.

👉 But for ARIMA to work, our data must satisfy one key condition: stationarity.
That’s exactly what this article is about: transforming a non-stationary series into a stationary one using simple techniques (differencing, statistical tests).
In other words: after observing, we now move on to preparing.

Simplified Theory

  1. What is stationarity? A stationary series is one whose statistical properties (mean, variance, autocorrelation) remain stable over time. 👉 Example: daily winter temperatures in a city (around a stable mean with small fluctuations).

A non-stationary series changes too much over time:

  • *Trend *(e.g., constant increase in smartphone sales).
  • *Seasonality *(e.g., ice cream sales peaking every summer).

ARIMA assumes the series is stationary: otherwise, it “believes” past trends will continue forever, leading to biased forecasts.

  1. Differencing To make a series stationary, we use differencing: Yt′=Yt−Yt−1Y't = Y_t - Y{t-1} In other words, each value is replaced by the *change between two successive periods. *
  2. This removes linear trends.
  3. For strong seasonality, we can apply seasonal differencing (e.g., difference with the value one year before).

Example: instead of analyzing raw monthly sales, we analyze the month-to-month change.

  1. Statistical tests (ADF & KPSS) To check if a series is stationary, we use two complementary tests: ADF (Augmented Dickey-Fuller Test)
  2. Null hypothesis (H₀): the series is non-stationary.
  3. If p-value < 0.05 → reject H₀ → the series is stationary. KPSS (Kwiatkowski-Phillips-Schmidt-Shin Test)
  4. Null hypothesis (H₀): the series is stationary.
  5. If p-value < 0.05 → reject H₀ → the series is non-stationary.

In practice:

  • We apply both tests for robustness.
  • If ADF and KPSS disagree, we refine with additional transformations.

Hands-on in Python

We’ll use a simple time series dataset: the annual flow of the Nile River (built into statsmodels).

import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.datasets import nile

# Load Nile dataset
data = nile.load_pandas().data
data.index = pd.date_range(start="1871", periods=len(data), freq="Y")
series = data['volume']

# Plot series
plt.figure(figsize=(10,4))
plt.plot(series)
plt.title("Annual Nile River Flow (1871–1970)")
plt.show()
Enter fullscreen mode Exit fullscreen mode

Check stationarity with ADF & KPSS

def adf_test(series):
    result = adfuller(series, autolag='AIC')
    print("ADF Statistic:", result[0])
    print("p-value:", result[1])
    if result[1] < 0.05:
        print("✅ The series is stationary (ADF).")
    else:
        print("❌ The series is NON-stationary (ADF).")

def kpss_test(series):
    result = kpss(series, regression='c', nlags="auto")
    print("KPSS Statistic:", result[0])
    print("p-value:", result[1])
    if result[1] < 0.05:
        print("❌ The series is NON-stationary (KPSS).")
    else:
        print("✅ The series is stationary (KPSS).")

print("ADF Test")
adf_test(series)
print("\nKPSS Test")
kpss_test(series)
Enter fullscreen mode Exit fullscreen mode

Apply differencing

series_diff = series.diff().dropna()

plt.figure(figsize=(10,4))
plt.plot(series_diff)
plt.title("Differenced series (1st difference)")
plt.show()

print("ADF Test after differencing")
adf_test(series_diff)
print("\nKPSS Test after differencing")
kpss_test(series_diff)
Enter fullscreen mode Exit fullscreen mode

Understanding ARIMA and its parameters

The ARIMA(p, d, q) model combines three parts:

  1. AR (AutoRegressive, p)
  2. Uses past values to predict the future.
  3. Example: if
  4. 𝑝 = 2
  5. p=2, the current value depends on the last 2 values.
  6. Formula: Yt=ϕ1Yt−1+ϕ2Yt−2+ϵtY_t = \phi_1 Y_{t-1} + \phi_2 Y_{t-2} + \epsilon_t

  7. I (Integrated, d)

  8. Number of differences applied to make the series stationary.

  9. Example: d=0d = 0 → no differencing.

  10. d=1d = 1 → one difference applied.

  11. MA (Moving Average, q)

  12. Uses past errors (residuals) for prediction.
    Example: if 𝑞 = 1, the prediction depends on the last error.
    Formula: Yt=θ1ϵt−1+θ2ϵt−2+ϵtY_t = \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \epsilon_t

In short:

  • p = past values memory
  • d = differencing degree
  • q = past errors memory

Example in Python

from statsmodels.tsa.arima.model import ARIMA

# ARIMA(1,1,1)
model = ARIMA(series, order=(1,1,1))
fit = model.fit()

print(fit.summary())
Enter fullscreen mode Exit fullscreen mode

Typical output:

  • AR (p) → past values effect
  • I (d) → differencing applied
  • MA (q) → past errors effect
  • AIC/BIC → model quality (lower = better)

Choosing the best parameters (p,d,q)

One of the main challenges with ARIMA is selecting the right p, d, q.

Choosing the best parameters (p,d,q)

Choosing p and q with ACF & PACF

  • ACF → helps to choose q (MA part).
  • PACF → helps to choose p (AR part).

Simple rules:

  • PACF cutoff → good candidate for p.
  • ACF cutoff → good candidate for q.

Top comments (0)