DEV Community

Edward Amor
Edward Amor

Posted on • Originally published at edwardamor.xyz on

Simple Time Series Forecasting with ML

Time series forecasting is an interesting sub-topic within the field of machine learning, mainly due to the time component which adds to the complexity of making predictions. Over the past month I’ve grown quite fond of it, and one of the best things I’ve learned is that standard supervised machine learning algorithms can be applied to time series to make predictions. The process is quite similar to a standard ML process with the exception that you have to structure your data a specific way to maintain the temporal structure.

Environment Setup

For setting up your environment I do recommend that you use anaconda, it’s kind of the de facto environment manager when doing data science. However, if you only have python on your system that is more than enough as well. I’m also assuming you have a terminal available with a unix-like shell such as bash or git bash.

$ mkdir tsml-tutorial
$ cd tsml-tutorial
Enter fullscreen mode Exit fullscreen mode

If you have anaconda available on your system:

$ conda create -n tsml jupyter pandas scipy numpy matplotlib seaborn scikit-learn statsmodels
$ conda activate tsml
Enter fullscreen mode Exit fullscreen mode

If you don’t have anaconda available on your system, but have python 3.3+ installed:

$ python -m venv venv
$ source venv/bin/activate
$ pip install jupyter pandas scipy numpy matplotlib seaborn scikit-learn statsmodels
Enter fullscreen mode Exit fullscreen mode

Now that you have an environment installed you can start following along by starting your local jupyter server and opening a fresh notebook.

$ jupyter notebook
Enter fullscreen mode Exit fullscreen mode

Data Extraction

For this little tutorial we’ll be using one of the most common univariate time series datasets, that you’ve probably already seen, Daily minimum temperatures in Melbourne, Australia, 1981-1990. The data consists of, as you may have guessed, the daily minimum temperature over the course of 10 years in Melbourne, Australia. We’ll be grabbing our data using pandas, from a github repository. You can find the data at the following url https://github.com/jbrownlee/Datasets/blob/master/daily-min-temperatures.csv.

# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, TimeSeriesSplit
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.compose import ColumnTransformer

%matplotlib inline
sns.set()
Enter fullscreen mode Exit fullscreen mode
# load our dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-min-temperatures.csv"
df = pd.read_csv(url)
Enter fullscreen mode Exit fullscreen mode
# output dataframe info
df.info()
Enter fullscreen mode Exit fullscreen mode
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3650 entries, 0 to 3649
Data columns (total 2 columns):
 # Column Non-Null Count Dtype  
--- ------ -------------- -----  
 0 Date 3650 non-null object 
 1 Temp 3650 non-null float64
dtypes: float64(1), object(1)
memory usage: 57.2+ KB
Enter fullscreen mode Exit fullscreen mode

Our dataframe consists of 2 columns, Date and Temp, with no missing values, and 3650 observations (365 per year). Our data is typed as follows:

  • Date column as a string which we’ll want to convert to a DateTimeIndex
  • Temp column as a float64.
# set Date as datetimeindex
df["Date"] = pd.to_datetime(df["Date"])
df = df.set_index("Date")

Enter fullscreen mode Exit fullscreen mode

Data Exploration

Since this is a time series, we’d be remiss if we didn’t plot the data out fully. We’ll also want to inspect our data and see if there is autocorrelation.

# plot full 10 years
fig, ax = plt.subplots(figsize=(16, 9))
df.plot.line(title="Daily minimum temperatures in Melbourne, Australia, 1981-1990", style=".", ax=ax)
df.rolling(30).mean().plot(figsize=(16, 9), style="-", ax=ax)
df.rolling(30).std().plot(figsize=(16, 9), style="-", ax=ax)
plt.legend(["Temperature", "30-Day Rolling Average", "30-Day Rolling Std. Dev."])
plt.show()
Enter fullscreen mode Exit fullscreen mode

Time Series Plot

Our plot of the temperature for the last ten years shows the temperature oscillates, almost like a sinusoidal wave. With our rolling standard deviation showing that we don’t grow in variance as time progresses. This would definitely be an optimal dataset for an SARIMA model, but that isn’t what we are here for.

Modeling

This is the crux of our tutorial and essentially we’ll be doing regression (using a RF regressor albeit) to predict the temperature. To start we’ll create some features such as time lags, and time features to incorporate the temporal structure into our model. To make it easier as our data grows though we’ll want to make a pipeline.

Features to create:

  • Time lags for the previous week
  • Rolling 30-Day Temperature average
  • Rolling 7-Day Temperature average
  • Month of the year
  • Week of the year
  • Next day’s temperature (what we are predicting)
# create our features and new dataframe
data = pd.DataFrame({f"t-{x}": df.Temp.shift(x) for x in range(7, 0, -1)})
data["t"] = df.Temp
data["day"] = df.index.isocalendar().day
data["week_of_year"] = df.index.isocalendar().week
data["month"] = df.index.month
data["7-Day Temp. Avg."] = df.Temp.rolling(7).mean()
data["30-Day Temp. Avg."] = df.Temp.rolling(30).mean()
data["t+1"] = df.Temp.shift(-1)
data = data.dropna()
Enter fullscreen mode Exit fullscreen mode

t is our current time step, and t+1 is the next day’s temperature which we’ll be predicting. To make our model aware of time we’ve also created a week and month feature, and included lag values and rolling averages. Next we’ll want to divide our data up into testing and training sets so we can do some training and validate our data. However, since we are working with time series data, there is a strict order dependence and so we can’t split and shuffle our data, we’ll have to maintain our order.

We’ll split our data up using a 70-30 split, where the last 30% of our data will be used as our testing data, and the first 70% is for our training.

# split data up into training and testing set, preprocess
num_cols = ['t-7', 't-6', 't-5', 't-4', 't-3', 't-2', 't-1', 't',
            '7-Day Temp. Avg.', '30-Day Temp. Avg.']
col_trans = ColumnTransformer(
    [
        ("categorical_cols", OneHotEncoder(drop="first", sparse=False), ["week_of_year", "month", "day"]),
        ("numeric_cols", StandardScaler(), num_cols)
    ]
)

pipe = Pipeline([("trans", col_trans), ("regression", RandomForestRegressor(n_jobs=-1))])

X = data.drop(columns="t+1")
y = data["t+1"]

X_train, X_test = X[:int(X.shape[0] * .7)], X[int(X.shape[0] * .7):]
y_train, y_test = y[:int(y.shape[0] * .7)], y[int(y.shape[0] * .7):]
Enter fullscreen mode Exit fullscreen mode

Since we can’t do cross validation, we’ll use the time series split class from sklearn, which is essentially the k-fold validation of time series validation. Our alternative would be to train our model on all the data, and use information criterion, which realistically when doing any model selection you should use multiple metrics to select your model.

# perform cross validation on training data
-cross_val_score(pipe, X_train, y_train, cv=TimeSeriesSplit(), scoring="neg_root_mean_squared_error").mean()
Enter fullscreen mode Exit fullscreen mode
2.5270646279116358
Enter fullscreen mode Exit fullscreen mode

Here we have the RMSE score after doing some cross validation, it isn’t anything special but verifies that we can apply our standard ML toolset on a time series dataset. From our CV we can see our model is about 2.5 degrees off.

# fit our model and make predictions on testing data
pipe.fit(X_train, y_train)
preds = pipe.predict(X_test)
# show the predictions
y[-1086:].plot(figsize=(16, 9), title="Predictions on Hold Out Data")
pd.Series(preds, index=y[-1086:].index).plot()
plt.legend(["Observations", "Predictions"])
plt.show()
Enter fullscreen mode Exit fullscreen mode

Model Predictions

# output RMSE score on test data
mean_squared_error(y_test, preds, squared=False)
Enter fullscreen mode Exit fullscreen mode
2.3153790785018127
Enter fullscreen mode Exit fullscreen mode

Looking at the predictions made by our model, we aren’t going to be telling anyone the weather anytime soon. However, this is a prime example of how to apply standard Machine Learning algorithms to your time series.

Top comments (0)