Periodic events forecasting is quite useful if you are, for example, the data aggregator. Data aggregators or data providers are organizations that collect statistical, financial or any other data from different sources, transform it and then offer it for further analysis and exploration (data as a service).
Data as a Service (DaaS)
It is really important for such organizations to monitor release dates in order to gather data as soon as it is released in the world and plan capacity to handle the incoming volumes of data.
Sometimes authorities that publish data have a schedule of future releases, sometimes not. In some cases, they announce schedule only for the next one or two months and, hence, you may want to make the publication schedule by yourself and predict release dates.
For the majority of statistical releases, you may find a pattern like the day of the week or month. For example, statistics can be released
- every last working day of the month,
- every third Tuesday of the month,
- every last second working day of the month, etc.
Having this in mind and previous history of release dates, we want to predict potential date or range of dates when the next data release might happen.
Case Study
As a case study let’s take the U.S. Conference Board (CB) Consumer Confidence Indicator. It is a leading indicator which measures the level of consumer confidence in economic activity. By using it, we can predict consumer spending, which plays a major role in overall economic activity.
The official data provider does not provide the schedule for this series, but many data aggregators like Investing.com have been collecting the data for a while and series’ release history is available there.
Goal: we need to predict what is the date of the next release(s).
Data preparation
We start with importing all packages for data manipulation, building machine learning models, and other data transformations.
# Data manipulation
import pandas as pd# Manipulation with dates
from datetime import date
from dateutil.relativedelta import relativedelta# Machine learning
import xgboost as xgb
from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
The next step is to get the list of release history dates. You may have a database with all data and history of release dates that you can use. To make this simple and focus on release dates prediction I will add history to DataFrame manually.
data = pd.DataFrame({'Date': ['2021-01-26','2020-12-22',
'2020-11-24','2020-10-27','2020-09-29',
'2020-08-25','2020-07-28','2020-06-30',
'2020-05-26','2020-04-28','2020-03-31',
'2020-02-25','2020-01-28','2019-12-31',
'2019-11-26','2019-10-29','2019-09-24',
'2019-08-27','2019-07-30','2019-06-25',
'2019-05-28']})
We also should add a column with 0 and 1 values to specify if release happened on this date. For now, we only have dates of releases, so we create a column filled with 1 values.
data['Date'] = pd.to_datetime(data['Date'])
data['Release'] = 1
After that, you need to create all rows for dates between releases in DataFrame and fill release column with zeros for them.
r = pd.date_range(start=data['Date'].min(), end=data['Date'].max())
data = data.set_index('Date').reindex(r).fillna(0.0)
.rename_axis('Date').reset_index()
Now dataset is ready for further manipulations.
Feature engineering
Prediction of next release dates heavily relies on feature engineering because actually, we do not have any features besides release date itself. Therefore, we will create the following features:
- month
- a calendar day of the month
- working day number
- day of the week
- week of month number
- monthly weekday occurrence (second Wednesday of the month)
data['Month'] = data['Date'].dt.month
data['Day'] = data['Date'].dt.day
data['Workday_N'] = np.busday_count(
data['Date'].values.astype('datetime64[M]'),
data['Date'].values.astype('datetime64[D]'))
data['Week_day'] = data['Date'].dt.weekday
data['Week_of_month'] = (data['Date'].dt.day
- data['Date'].dt.weekday - 2) // 7 + 2
data['Weekday_order'] = (data['Date'].dt.day + 6) // 7
data = data.set_index('Date')
Training Machine learning model
By default, we need to split our dataset into two parts: train and test. Don’t forget to set shuffle argument to False, because our goal is to create a forecast based on past events.
x_train, x_test, y_train, y_test = train_test_split(data.drop(['Release'], axis=1), data['Release'],
test_size=0.3, random_state=1, shuffle=False)
In general, shuffle helps to get rid of overfitting by choosing different training observations. But it is not our case, every time we should have all history of publication events.
In order to choose the best prediction model, we will test the following models:
- XGBoost
- K-nearest Neighbors (KNN)
- RandomForest
XGBoost
We will use XGBoost with tree base learners and grid search method to choose the best parameters. It searches over all possible combinations of parameters and chooses the best based on cross-validation evaluation.
A drawback of this approach is a long computation time.
Alternatively, the random search can be used. It iterates over the given range given the number of times, choosing values randomly. After a certain number of iterations, it chooses the best model.
However, when you have a large number of parameters, random search tests a relatively low number of combinations. It makes finding a *really*optimal combination almost impossible.
To use grid search you need to specify the list of possible values for each parameter.
DM_train = xgb.DMatrix(data=x_train, label=y_train)
grid_param = {"learning_rate": [0.01, 0.1],
"n_estimators": [100, 150, 200],
"alpha": [0.1, 0.5, 1],
"max_depth": [2, 3, 4]}
model = xgb.XGBRegressor()
grid_mse = GridSearchCV(estimator=model, param_grid=grid_param,
scoring="neg_mean_squared_error",
cv=4, verbose=1)
grid_mse.fit(x_train, y_train)
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))
As you see the best parameters for our XGBoost model are: alpha = 0.5, n_estimators = 200, max_depth = 4, learning_rate = 0.1
.
Let’s train the model with obtained parameters.
xgb_model = xgb.XGBClassifier(objective ='reg:squarederror',
colsample_bytree = 1,
learning_rate = 0.1,
max_depth = 4,
alpha = 0.5,
n_estimators = 200)
xgb_model.fit(x_train, y_train)
xgb_prediction = xgb_model.predict(x_test)
K-nearest Neighbors (KNN)
K-nearest neighbors model is meant to be used when you are trying to find similarities between observations. This is exactly our case because we are trying to find patterns in the past release dates.
KNN algorithm has less parameters to tune, so it is more simple for those who have not used it before.
knn = KNeighborsClassifier(n_neighbors = 3, algorithm = 'auto',
weights = 'distance')
knn.fit(x_train, y_train)
knn_prediction = knn.predict(x_test)
Random Forest
Random forest basic model parameters tuning usually doesn’t take a lot of time. You simply iterate over the possible number of estimators and the maximum depth of trees and choose optimal ones using elbow method.
random_forest = RandomForestClassifier(n_estimators=50,
max_depth=10, random_state=1)
random_forest.fit(x_train, y_train)
rf_prediction = random_forest.predict(x_test)
Comparing the results
We will use confusion matrix to evaluate performance of trained models. It helps us compare models side by side and understand whether our parameters should be tuned any further.
xgb_matrix = metrics.confusion_matrix(xgb_prediction, y_test)
print(f"""
Confusion matrix for XGBoost model:
TN:{xgb_matrix[0][0]} FN:{xgb_matrix[0][1]}
FP:{xgb_matrix[1][0]} TP:{xgb_matrix[1][1]}""")knn_matrix = metrics.confusion_matrix(knn_prediction, y_test)
print(f"""
Confusion matrix for KNN model:
TN:{knn_matrix[0][0]} FN:{knn_matrix[0][1]}
FP:{knn_matrix[1][0]} TP:{knn_matrix[1][1]}""")rf_matrix = metrics.confusion_matrix(rf_prediction, y_test)
print(f"""
Confusion matrix for Random Forest model:
TN:{rf_matrix[0][0]} FN:{rf_matrix[0][1]}
FP:{rf_matrix[1][0]} TP:{rf_matrix[1][1]}""")
As you see, both XGBoost and RandomForest show good performance. They both were able to catch the pattern and predict dates correctly in most cases. However, both models made a mistake with December 2020 release, because it breaks release pattern.
KNN is less accurate than the previous two. It failed to predict three dates correctly and missed 5 releases. At this point, we do not proceed with KNN. In general, it works better if data is normalized, so you can try to tune it if you want.
Concerning the remaining two, for the initial goal, XGBoost model is considered to be overcomplicated in terms of hyperparameters tuning, so RandomForest should be our choice.
Now we need to create DataFrame with future dates for prediction and use trained RandomForest model to predict future releases for one year ahead.
x_predict = pd.DataFrame(pd.date_range(date.today(), (date.today() +
relativedelta(years=1)),freq='d'), columns=['Date'])
x_predict['Month'] = x_predict['Date'].dt.month
x_predict['Day'] = x_predict['Date'].dt.day
x_predict['Workday_N'] = np.busday_count(
x_predict['Date'].values.astype('datetime64[M]'),
x_predict['Date'].values.astype('datetime64[D]'))
x_predict['Week_day'] = x_predict['Date'].dt.weekday
x_predict['Week_of_month'] = (x_predict['Date'].dt.day -
x_predict['Date'].dt.weekday - 2)//7+2
x_predict['Weekday_order'] = (x_predict['Date'].dt.day + 6) // 7
x_predict = x_predict.set_index('Date')prediction = xgb_model.predict(x_predict)
That’s it — we created forecast of release dates for U.S. CB Consumer Confidence series for one year ahead.
Conclusion
If you want to predict future dates for periodic events, you should think about meaningful features to create. They should include all information about patterns you can find in history. As you can see we did not spend a lot of time on model’s tuning — even simple models can give good results if you use the right features.
Thank you for reading till the end. I do hope it was helpful, please let me know if you spot any mistakes in the comments.
Top comments (0)