DEV Community

Cover image for Data Science For Cats : PART 6
Marjan Ferdousi
Marjan Ferdousi

Posted on

Data Science For Cats : PART 6

Fitting Data To Your Model: Time Series Analysis

(Data was taken and slightly edited from an open sourced time series dataset.)

Kitty, do you remember that hooman told you about the components and stationarity (here) of a time series the other day? Of course you do, but you do not remember hooman telling you how to surely detect if a time series is stationary or not.

To check if our data is stationary, the first thing you are going to do is, you start believing that your data is not stationary, and keep wishing that someone proves you wrong. It’s like, whenever your hooman makes you wear a harness and puts you inside your travel bag, you immediately understand hooman is taking you to the vet and keep praying that you are wrong, and keep wishing that hooman is taking you to somewhere else. Here, believing ‘data is non stationary’ or ‘hooman is taking you to vet’ is called a ‘NULL HYPOTHESIS’. On the other hand, the event of your data being stationary, or hooman taking you to a park or playground is called ‘ALTERNATE HYPOTHESIS’. Now you will have to check if your null hypothesis is true or not.

First you load data from the csv file.

df = pd.read_csv("likestuna - Sheet3.csv") 
df.head()
Enter fullscreen mode Exit fullscreen mode

Alt Text

Now hooman says he is going to ‘test’ the stationarity using a method named ADF test, also known as Augmented Dickey Fuller test. Hoomans named Dickey and Fuller found this method where they check something named ‘p-value’. This p-value is like a judge of your ADF test. The higher his value is, the more he supports your null hypothesis. If the value is lower than 0.05, it speaks against your hypothesis, that means whatever you believed at first is not true.

from statsmodels.tsa.stattools import adfuller
def adfuller_test(sales):
    result=adfuller(sales)
    labels = ['ADF Test Statistic','p-value','#Lags Used','Number of Observations']
    for value,label in zip(result,labels):
        print(label+' : '+str(value) )

    if result[1] <= 0.05:
        print("Strong evidence against the null hypothesis (Ho), rejects the null hypothesis. Data is stationary")
    else:
        print("Weak evidence against null hypothesis (Ho), proves the null hypothesis. Data it is non-stationary ")

adfuller_test(df['Sales'])
Enter fullscreen mode Exit fullscreen mode

Hooman gets the output:

Alt Text

You see, your ADF test says your data is not stationary. What do you do now?

To make a series stationary, at first try to remember when or why a time series becomes non stationary. If you remember the diagrams hooman has shown you on the 4th part of your data science journey, after you plotted the data, there were repeated similar looking fragments that kept going upwards as we had an upward trend. You can say that trend is the reason why your data gets distributed in so many different levels.

A nice way to stationarize a time series is to shift your time series a bit and to compare it with your original series. By shifting, hooman means introducing a delay, and you will be calling this delay ‘LAG’. For example, what if your data started after one month of delay?

Alt Text

You want to see how your data looks like after shifts.

df['Sales']
df['lag 1']=df['Sales'].shift(1)
df['lag 2']=df['Sales'].shift(2)
df.head()
Enter fullscreen mode Exit fullscreen mode

Alt Text

Hooman tries to plot your data, and a 5 month delayed data in the same graph and observe how it looks.

ax = df['Sales'].plot(color = 'b') 
df['Sales'].shift(5).plot(ax=ax, color = 'm')
Enter fullscreen mode Exit fullscreen mode

Alt Text

Here, blue is your original data, and magenta is the data with a delay of 5 units (in your case, months).

Hooman says that if you plot the difference of your data and 1 unit lagged data, there is a good chance that it will become stationary. Now what is this difference?

Alt Text

df['Sales First Difference'] = df['Sales'] - df['Sales'].shift(1)
df['Sales Second Difference'] = df['Sales'] - df['Sales'].shift(2)
df.head()
Enter fullscreen mode Exit fullscreen mode

Alt Text

Hooman now checks if the first difference is stationary.

adfuller_test(df['Sales First Difference'].dropna())
Enter fullscreen mode Exit fullscreen mode

Now the p-value becomes 0.054213290283824704, which is barely above the 0.05 cut off margin, and that implies that our data is not stationary yet. If you take the second difference,

adfuller_test(df['Sales Second Difference'].dropna())
Enter fullscreen mode Exit fullscreen mode

The p-value now becomes 0.03862975767698862, which indicates that the data is definitely stationary. So you need to take the second difference to stationarize the data?

Well, actually no. Now you are confused. Hooman now asks you to have a closer look at the p-values of your first and second difference ADF result. In case of first difference, the p-value is almost 0.05, and the difference between 0.054213290283824704 and 0.05 is totally insignificant. So you can say that the first difference has made your data almost stationary. If you take the second difference in such cases where the first difference makes the data almost stationary, there is a chance that your data becomes over stationarized, which will later not give you a good forecast. If you have any confusion, you can cross check the value using some other stationarity check method, such as KPSS. Hooman says that you have to remember this term as ‘order of differencing’, or simply as ‘d’. If you had to take the 99th difference to get a stationary time series, your d value would have been 99. In this case, your d value is simply 1.

There is another method you can try to confirm your d value. You have to plot the correlation between a time series, and a lagged version of it and observe the output. Hooman calls such correlations ‘AUTOCORRELATION’. It usually decays, so if the output curve immediately approaches 0,we can assume that the difference between the series and that lagged version is stationary.

from pandas.plotting import autocorrelation_plot
import matplotlib.pyplot as plt
autocorrelation_plot(df['Sales First Difference'].dropna())
plt.show()
Enter fullscreen mode Exit fullscreen mode

Alt Text

Hooman will be forecasting this data using a model named ‘ARIMA’, Auto Regressive Integrated Moving Average. You need to know 3 values to tune this model, and you already know one of them, the d value.

You see, there are three parts in the ARIMA model. AR or autoregressive part is denoted by p, Integrated part is denoted by d (which you already have calculated) and the MA or moving average part is denoted by q. You already know what d does, and now curious what the AR and MA terms do. Hooman starts explaining.

The first term is ‘AUTOREGRESSIVE’, which has two parts, auto, and regressive. You already know about regression from your previous conversations with hooman. Hooman says, auto means something like ‘with yourself’. So you can say, autoregression is the event where the lagged values of your data have an impact on your current values. AR(x) means x number of lagged terms will have impact on your current data. On the other hand, the ‘MOVING AVERAGE’ term removes the randomness from your data. MA(x) means you are taking x previous observations to understand your current data.

Now how do you calculate these AR (or p) and MA (or q) terms? Hooman explains the AR term calculations first. You need to plot a ‘Partial Autocorrelation Function’ or ‘PACF’ to find out the AR terms. PACF? What’s that? Hooman says, in PACF, you take each of the lagged values of your series, find the residuals by removing the effects that are explained by the lags and find a ‘partial’ correlation.

from statsmodels.graphics.tsaplots import plot_pacf

plt.rcParams.update({'figure.figsize':(9,3), 'figure.dpi':120})

fig, axes = plt.subplots(1, 2, sharex=True)
axes[0].plot(df['Sales First Difference']); axes[0].set_title('1st Differencing')
axes[1].set(ylim=(0,5))
plot_pacf(df['Sales First Difference'].dropna(), ax=axes[1])

plt.show()
Enter fullscreen mode Exit fullscreen mode

The output looks like this:

Alt Text

Hooman is not sure how many bars crossed the blue zone after 0. Could be 1 or 2. So out AR value would be either 1 or 2.

To explain the MA terms, you have to check an ‘Autocorrelation Function’ or ACF to find the MA terms. An ACF will give you a complete autocorrelation of a time series with the lagged values of it.

from statsmodels.graphics.tsaplots import plot_acf

fig, axes = plt.subplots(1, 2, sharex=True)
axes[0].plot(df['Sales First Difference']); axes[0].set_title('1st Differencing')
axes[1].set(ylim=(0,1.2))
plot_acf(df['Sales First Difference'].dropna(), ax=axes[1])

plt.show()
Enter fullscreen mode Exit fullscreen mode

Alt Text

Hooman similarly calculated the MA value from this graph. It could be 3 or 4.

So your (p, d, q) combination could be (1, 1, 3), (2, 1 ,3), (1, 1, 4) or (2, 1, 4). Which one is it?

Take the first combination, use that to fit in an ARIMA model, and then calculate summary like this:

from statsmodels.tsa.stattools import acf

# Create Training and Test
train = df['Sales'][:130]
test = df['Sales'][130:]

from statsmodels.tsa.arima_model import ARIMA
# Build Model
model = ARIMA(train, order=(1, 1, 3))  
fitted = model.fit(disp=-1)  
print(fitted.summary())
Enter fullscreen mode Exit fullscreen mode

There will be an ‘AIC’ term in the output. AIC means Akaike Information Criterion, which estimates how good your model is. The lower the value, the better the model is.

Alt Text

Now hooman compares AIC of all 4 combinations.

Alt Text

You see, the model with order (2, 1, 4) gives the best result. Hooman now tries to plot the data.

from statsmodels.tsa.stattools import acf

# Create Training and Test
train = df['Sales'][:130]
test = df['Sales'][130:]

from statsmodels.tsa.arima_model import ARIMA
# Build Model
# model = ARIMA(train, order=(3,1,1))  
model = ARIMA(train, order=(2, 1, 4))  
fitted = model.fit(disp=-1)  

# Forecast
fc, se, conf = fitted.forecast(14, alpha=0.05)  # 95% conf

# Make as pandas series
fc_series = pd.Series(fc, index=test.index)
lower_series = pd.Series(conf[:, 0], index=test.index)
upper_series = pd.Series(conf[:, 1], index=test.index)

# Plot
plt.figure(figsize=(12,5), dpi=100)
plt.plot(train, label='training')
plt.plot(test, label='actual')
plt.plot(fc_series, label='forecast')
plt.fill_between(lower_series.index, lower_series, upper_series, 
                 color='k', alpha=.15)
plt.title('Forecast vs Actuals')
plt.legend(loc='upper left', fontsize=8)
plt.show()
Enter fullscreen mode Exit fullscreen mode

The more data you have, the better the answer will be.

Alt Text

Now you have your sales forecast!

Top comments (2)

Collapse
 
juanjobaeza profile image
Juan José Baeza López

Hello Marjan Ferdousi,
first of all thank you very much for sharing your knowledge with community. Can you indicate where can i download the datasets to practique the code?. Thank you very much, your effort is appreciated!

Collapse
 
orthymarjan profile image
Marjan Ferdousi

Hello! thank you for making me realize that I totally forgot to add the dataset source. I usually download practice datasets from Kaggle, or from this link: archive.ics.uci.edu/ml/datasets.php I hope that's helpful for you.