Time Series Forecasting
Original Data Link: Appliances Energy Prediction Dataset
Target:
- Use suitable time series analysis techniques.
- Apply at least two time-series forecasting methods.
- Compare the results from all candidate models.
- Choose the best model.
- Justify your choice and discuss the results.
1. Read the dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,accuracy_score
import warnings
warnings.filterwarnings('ignore')
# read dataset from csv file
data = pd.read_csv('energydata_complete.csv')
# load the dataset
data_target = pd.read_csv('energydata_complete.csv', usecols=[1], engine='python')
# set standard color variable
color = "#1f77b4"
# display first few lines to make sure correctly import
data.head()
date | Appliances | lights | T1 | RH_1 | T2 | RH_2 | T3 | RH_3 | T4 | ... | T9 | RH_9 | T_out | Press_mm_hg | RH_out | Windspeed | Visibility | Tdewpoint | rv1 | rv2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-01-11 17:00:00 | 60 | 30 | 19.89 | 47.596667 | 19.2 | 44.790000 | 19.79 | 44.730000 | 19.000000 | ... | 17.033333 | 45.53 | 6.600000 | 733.5 | 92.0 | 7.000000 | 63.000000 | 5.3 | 13.275433 | 13.275433 |
1 | 2016-01-11 17:10:00 | 60 | 30 | 19.89 | 46.693333 | 19.2 | 44.722500 | 19.79 | 44.790000 | 19.000000 | ... | 17.066667 | 45.56 | 6.483333 | 733.6 | 92.0 | 6.666667 | 59.166667 | 5.2 | 18.606195 | 18.606195 |
2 | 2016-01-11 17:20:00 | 50 | 30 | 19.89 | 46.300000 | 19.2 | 44.626667 | 19.79 | 44.933333 | 18.926667 | ... | 17.000000 | 45.50 | 6.366667 | 733.7 | 92.0 | 6.333333 | 55.333333 | 5.1 | 28.642668 | 28.642668 |
3 | 2016-01-11 17:30:00 | 50 | 40 | 19.89 | 46.066667 | 19.2 | 44.590000 | 19.79 | 45.000000 | 18.890000 | ... | 17.000000 | 45.40 | 6.250000 | 733.8 | 92.0 | 6.000000 | 51.500000 | 5.0 | 45.410389 | 45.410389 |
4 | 2016-01-11 17:40:00 | 60 | 40 | 19.89 | 46.333333 | 19.2 | 44.530000 | 19.79 | 45.000000 | 18.890000 | ... | 17.000000 | 45.40 | 6.133333 | 733.9 | 92.0 | 5.666667 | 47.666667 | 4.9 | 10.084097 | 10.084097 |
5 rows × 29 columns
Variables Description:
T1: Temperature in kitchen area, in Celsius
T2: Temperature in living room area, in Celsius
T3: Temperature in laundry room area, in Celsius
T4: Temperature in office room, in Celsius
T5: Temperature in bathroom, in Celsius
T6: Temperature outside the building (north side), in Celsius
T7: Temperature in ironing room , in Celsius
T8: Temperature in teenager room 2, in Celsius
T9: Temperature in parents room, in Celsius
RHI: Humidity in kitchen area, in %
RH2: Humidity in living room area, in %
RH3: Humidity in laundry room area, in %
RH4: Humidity in office room, in %
RH5: Humidity in bathroom, in %
RH6: Humidity outside the building (north side), in %
RH7: Humidity in ironing room, in %
RH8: Humidity in teenager room 2,in %
RH9: Humidity in parents room, in %
To: Temperature outside (from Chievres weather station), in Celsius
Pressure: (from Chievres weather station), in mm Hg
Hg RHout: Humidity outside (from Chievres weather station), in %
Wind speed: (from Chievres weather station), in m/s
Visibility: (from Chievres weather station), in km
Tdewpoint: (from Chievres weather station), A*C
Appliances, energy use in Wh: Dependent variable
2. Analyse and visualise the data
2.1 Check data's basic information
# Check data scale
data.shape
(19735, 29)
# Check data types
data.dtypes
date object
Appliances int64
lights int64
T1 float64
RH_1 float64
T2 float64
RH_2 float64
T3 float64
RH_3 float64
T4 float64
RH_4 float64
T5 float64
RH_5 float64
T6 float64
RH_6 float64
T7 float64
RH_7 float64
T8 float64
RH_8 float64
T9 float64
RH_9 float64
T_out float64
Press_mm_hg float64
RH_out float64
Windspeed float64
Visibility float64
Tdewpoint float64
rv1 float64
rv2 float64
dtype: object
2.2 Handle missing values
# Check for missing values
data.isnull().sum()
date 0
Appliances 0
lights 0
T1 0
RH_1 0
T2 0
RH_2 0
T3 0
RH_3 0
T4 0
RH_4 0
T5 0
RH_5 0
T6 0
RH_6 0
T7 0
RH_7 0
T8 0
RH_8 0
T9 0
RH_9 0
T_out 0
Press_mm_hg 0
RH_out 0
Windspeed 0
Visibility 0
Tdewpoint 0
rv1 0
rv2 0
dtype: int64
From the abundance of 0s observed in the dataset, it's evident that there are no missing values present. Consequently, there's no need for further handling of missing values.
# Check if we have any duplicated data
any(data.duplicated())
False
# Convert 'date' column to datetime format, so that we could use it for analyse with time series functionality
data['date'] = pd.to_datetime(data['date'])
The presence of 'False' indicates the non-existence of duplicate data.
2.3 Exploratory Data Analysis (EDA)
- Add Features
Since the target variable in the dataset is the electricity consumption of household appliances ('Appliances'), it may be influenced by other features such as temperature. However, intuitively, electricity consumption is directly linked to time. For example, electricity usage in the evening is expected to be higher than during the day because lights are commonly used, and there may be more household activities like watching TV. Conversely, from midnight to early morning, the target variable should be at its lowest as most people are asleep, resulting in minimal electricity usage.
Therefore, to better observe the distribution of the target variable over time, we have introduced the following variables.
# Copy dataset for time-based analysis
new_data = data.copy()
new_data
date | Appliances | lights | T1 | RH_1 | T2 | RH_2 | T3 | RH_3 | T4 | ... | T9 | RH_9 | T_out | Press_mm_hg | RH_out | Windspeed | Visibility | Tdewpoint | rv1 | rv2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-01-11 17:00:00 | 60 | 30 | 19.890000 | 47.596667 | 19.200000 | 44.790000 | 19.790000 | 44.730000 | 19.000000 | ... | 17.033333 | 45.5300 | 6.600000 | 733.5 | 92.000000 | 7.000000 | 63.000000 | 5.300000 | 13.275433 | 13.275433 |
1 | 2016-01-11 17:10:00 | 60 | 30 | 19.890000 | 46.693333 | 19.200000 | 44.722500 | 19.790000 | 44.790000 | 19.000000 | ... | 17.066667 | 45.5600 | 6.483333 | 733.6 | 92.000000 | 6.666667 | 59.166667 | 5.200000 | 18.606195 | 18.606195 |
2 | 2016-01-11 17:20:00 | 50 | 30 | 19.890000 | 46.300000 | 19.200000 | 44.626667 | 19.790000 | 44.933333 | 18.926667 | ... | 17.000000 | 45.5000 | 6.366667 | 733.7 | 92.000000 | 6.333333 | 55.333333 | 5.100000 | 28.642668 | 28.642668 |
3 | 2016-01-11 17:30:00 | 50 | 40 | 19.890000 | 46.066667 | 19.200000 | 44.590000 | 19.790000 | 45.000000 | 18.890000 | ... | 17.000000 | 45.4000 | 6.250000 | 733.8 | 92.000000 | 6.000000 | 51.500000 | 5.000000 | 45.410389 | 45.410389 |
4 | 2016-01-11 17:40:00 | 60 | 40 | 19.890000 | 46.333333 | 19.200000 | 44.530000 | 19.790000 | 45.000000 | 18.890000 | ... | 17.000000 | 45.4000 | 6.133333 | 733.9 | 92.000000 | 5.666667 | 47.666667 | 4.900000 | 10.084097 | 10.084097 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
19730 | 2016-05-27 17:20:00 | 100 | 0 | 25.566667 | 46.560000 | 25.890000 | 42.025714 | 27.200000 | 41.163333 | 24.700000 | ... | 23.200000 | 46.7900 | 22.733333 | 755.2 | 55.666667 | 3.333333 | 23.666667 | 13.333333 | 43.096812 | 43.096812 |
19731 | 2016-05-27 17:30:00 | 90 | 0 | 25.500000 | 46.500000 | 25.754000 | 42.080000 | 27.133333 | 41.223333 | 24.700000 | ... | 23.200000 | 46.7900 | 22.600000 | 755.2 | 56.000000 | 3.500000 | 24.500000 | 13.300000 | 49.282940 | 49.282940 |
19732 | 2016-05-27 17:40:00 | 270 | 10 | 25.500000 | 46.596667 | 25.628571 | 42.768571 | 27.050000 | 41.690000 | 24.700000 | ... | 23.200000 | 46.7900 | 22.466667 | 755.2 | 56.333333 | 3.666667 | 25.333333 | 13.266667 | 29.199117 | 29.199117 |
19733 | 2016-05-27 17:50:00 | 420 | 10 | 25.500000 | 46.990000 | 25.414000 | 43.036000 | 26.890000 | 41.290000 | 24.700000 | ... | 23.200000 | 46.8175 | 22.333333 | 755.2 | 56.666667 | 3.833333 | 26.166667 | 13.233333 | 6.322784 | 6.322784 |
19734 | 2016-05-27 18:00:00 | 430 | 10 | 25.500000 | 46.600000 | 25.264286 | 42.971429 | 26.823333 | 41.156667 | 24.700000 | ... | 23.200000 | 46.8450 | 22.200000 | 755.2 | 57.000000 | 4.000000 | 27.000000 | 13.200000 | 34.118851 | 34.118851 |
19735 rows × 29 columns
# Converting date into datetime
new_data['date'] = new_data['date'].astype('datetime64[ns]')
new_data['Date'] = pd.to_datetime(new_data['date']).dt.date
new_data['Time'] = pd.to_datetime(new_data['date']).dt.time
new_data['hour'] = new_data['date'].dt.hour
new_data['month'] = new_data['date'].dt.month
new_data['day_of_week'] = new_data['date'].dt.dayofweek
new_data= new_data.drop(["date"], axis=1)
new_data
Appliances | lights | T1 | RH_1 | T2 | RH_2 | T3 | RH_3 | T4 | RH_4 | ... | Windspeed | Visibility | Tdewpoint | rv1 | rv2 | Date | Time | hour | month | day_of_week | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 60 | 30 | 19.890000 | 47.596667 | 19.200000 | 44.790000 | 19.790000 | 44.730000 | 19.000000 | 45.566667 | ... | 7.000000 | 63.000000 | 5.300000 | 13.275433 | 13.275433 | 2016-01-11 | 17:00:00 | 17 | 1 | 0 |
1 | 60 | 30 | 19.890000 | 46.693333 | 19.200000 | 44.722500 | 19.790000 | 44.790000 | 19.000000 | 45.992500 | ... | 6.666667 | 59.166667 | 5.200000 | 18.606195 | 18.606195 | 2016-01-11 | 17:10:00 | 17 | 1 | 0 |
2 | 50 | 30 | 19.890000 | 46.300000 | 19.200000 | 44.626667 | 19.790000 | 44.933333 | 18.926667 | 45.890000 | ... | 6.333333 | 55.333333 | 5.100000 | 28.642668 | 28.642668 | 2016-01-11 | 17:20:00 | 17 | 1 | 0 |
3 | 50 | 40 | 19.890000 | 46.066667 | 19.200000 | 44.590000 | 19.790000 | 45.000000 | 18.890000 | 45.723333 | ... | 6.000000 | 51.500000 | 5.000000 | 45.410389 | 45.410389 | 2016-01-11 | 17:30:00 | 17 | 1 | 0 |
4 | 60 | 40 | 19.890000 | 46.333333 | 19.200000 | 44.530000 | 19.790000 | 45.000000 | 18.890000 | 45.530000 | ... | 5.666667 | 47.666667 | 4.900000 | 10.084097 | 10.084097 | 2016-01-11 | 17:40:00 | 17 | 1 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
19730 | 100 | 0 | 25.566667 | 46.560000 | 25.890000 | 42.025714 | 27.200000 | 41.163333 | 24.700000 | 45.590000 | ... | 3.333333 | 23.666667 | 13.333333 | 43.096812 | 43.096812 | 2016-05-27 | 17:20:00 | 17 | 5 | 4 |
19731 | 90 | 0 | 25.500000 | 46.500000 | 25.754000 | 42.080000 | 27.133333 | 41.223333 | 24.700000 | 45.590000 | ... | 3.500000 | 24.500000 | 13.300000 | 49.282940 | 49.282940 | 2016-05-27 | 17:30:00 | 17 | 5 | 4 |
19732 | 270 | 10 | 25.500000 | 46.596667 | 25.628571 | 42.768571 | 27.050000 | 41.690000 | 24.700000 | 45.730000 | ... | 3.666667 | 25.333333 | 13.266667 | 29.199117 | 29.199117 | 2016-05-27 | 17:40:00 | 17 | 5 | 4 |
19733 | 420 | 10 | 25.500000 | 46.990000 | 25.414000 | 43.036000 | 26.890000 | 41.290000 | 24.700000 | 45.790000 | ... | 3.833333 | 26.166667 | 13.233333 | 6.322784 | 6.322784 | 2016-05-27 | 17:50:00 | 17 | 5 | 4 |
19734 | 430 | 10 | 25.500000 | 46.600000 | 25.264286 | 42.971429 | 26.823333 | 41.156667 | 24.700000 | 45.963333 | ... | 4.000000 | 27.000000 | 13.200000 | 34.118851 | 34.118851 | 2016-05-27 | 18:00:00 | 18 | 5 | 4 |
19735 rows × 33 columns
- Hourly distribution of target variable
# Calculate the total energy consumed by the appliance per hour
app_hour = new_data.groupby(by='hour',as_index=False)['Appliances'].sum()
# Sort app_hour by descending order
app_hour.sort_values(by='Appliances',ascending=False)
hour | Appliances | |
---|---|---|
18 | 18 | 156670 |
17 | 17 | 133600 |
19 | 19 | 117600 |
11 | 11 | 109430 |
20 | 20 | 104380 |
10 | 10 | 103060 |
13 | 13 | 102540 |
12 | 12 | 101630 |
16 | 16 | 98560 |
9 | 9 | 92710 |
14 | 14 | 89010 |
8 | 8 | 87250 |
15 | 15 | 86990 |
21 | 21 | 79320 |
7 | 7 | 64650 |
22 | 22 | 56840 |
6 | 6 | 47440 |
23 | 23 | 46840 |
0 | 0 | 43390 |
5 | 5 | 43350 |
1 | 1 | 42190 |
4 | 4 | 40570 |
2 | 2 | 40340 |
3 | 3 | 39650 |
# Function to format the numerical value to 'k' format
def format_value(value):
if value < 1000:
return f"{value:.0f}"
elif value < 10000:
return f"{value/1000:.1f}k"
else:
return f"{value/1000:.0f}k"
# Plotting the data
plt.figure(figsize=(10, 4)) # Set the figure size
bars = plt.bar(app_hour['hour'], app_hour['Appliances'], color=color) # Create a bar plot
# Add numerical values on top of each bar
for bar in bars:
height = bar.get_height()
plt.text(bar.get_x() + bar.get_width() / 2, height, format_value(height), ha='center', va='bottom')
plt.xlabel('Hour') # Set the x-axis label
plt.ylabel('Energy Consumed by Appliances') # Set the y-axis label
plt.title('Energy Consumption by Hour') # Set the title
plt.xticks(app_hour['hour']) # Set the x-axis ticks to match the hours
plt.grid(axis='y', linestyle='--', alpha=0.7) # Add gridlines to the y-axis
plt.show() # Show the plot
The graph above confirms our hypothesis: electricity consumption is highest between 17:00 and 20:00, and lowest between 23:00 and 06:00 the following day. This aligns with our prediction above, so it could also serves as evidence of the reliability of the data.
- Weekday distribution of target variable
# Calculate the total energy consumed by the appliance per hour
app_week_day = new_data.groupby(by='day_of_week',as_index=False)['Appliances'].sum()
# Sort app_hour by descending order
app_week_day.sort_values(by='Appliances',ascending=False)
day_of_week | Appliances | |
---|---|---|
0 | 0 | 309610 |
4 | 4 | 297650 |
5 | 5 | 290690 |
3 | 3 | 260450 |
6 | 6 | 259690 |
2 | 2 | 259000 |
1 | 1 | 250920 |
day_names = {1: 'Monday', 2: 'Tuesday', 3: 'Wednesday', 4: 'Thursday', 5: 'Friday', 6: 'Saturday', 0: 'Sunday'}
# Plotting the data
plt.figure(figsize=(10, 4)) # Set the figure size
bars = plt.bar(app_week_day['day_of_week'], app_week_day['Appliances'], color=color) # Create a bar plot
# Add numerical values on top of each bar
for bar in bars:
height = bar.get_height()
plt.text(bar.get_x() + bar.get_width() / 2, height, format_value(height), ha='center', va='bottom')
plt.xlabel('Day of the week') # Set the x-axis label
plt.ylabel('Energy Consumed by Appliances') # Set the y-axis label
plt.title('Energy Consumption by day_of_week') # Set the title
plt.xticks(list(day_names.keys()), list(day_names.values()))
plt.grid(axis='y', linestyle='--', alpha=0.7) # Add gridlines to the y-axis
plt.show() # Show the plot
As what I did to the hours analysis within one day, there could be pattern during the time span of a week. However, when split by the day of the week, the distribution of the target variable does not exhibit significant features. The only noteworthy point is that the electricity consumption on Sundays is significantly higher than on other days, approximately 10%.
- General view of the whole dataset
# Filter columns excluding 'rv1' and 'rv2'
selected_columns = [col for col in data.columns[2:] if col not in ['rv1', 'rv2']]
# Determine the number of rows needed for the subplots
num_cols = len(selected_columns)
num_rows = (num_cols - 1) // 5 + 1
# Create subplots
fig, axs = plt.subplots(num_rows, 5, figsize=(14, num_rows*4))
# Loop through selected columns and plot time series on subplots
for i in range(num_rows):
for j in range(5):
index = i * 5 + j
if index < num_cols:
var = selected_columns[index]
sns.lineplot(y=data[var], x=data['date'], ax=axs[i,j], linewidth=1.5)
axs[i,j].set_xlabel('Date')
axs[i,j].set_ylabel(var)
axs[i,j].set_title('{} Time Series Data'.format(var))
axs[i,j].tick_params(axis='x', rotation=90)
else:
axs[i,j].axis('off')
plt.tight_layout()
plt.show()
# Extracting numerical columns from the data
numerical = ['Appliances', 'lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3', 'RH_3',
'T4', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8',
'RH_8', 'T9', 'RH_9', 'T_out', 'Press_mm_hg', 'RH_out',
'Windspeed', 'Visibility', 'Tdewpoint', 'rv1', 'rv2']
# Create subplots
fig, axs = plt.subplots(4, 7, figsize=(14, 10))
# Flatten the array of axes for easy iteration
axs = axs.flatten()
# Loop through each numerical column and plot its distribution
for i, col in enumerate(numerical):
ax = axs[i]
sns.boxplot(x=data['date'].dt.month, y=data[col], ax=ax)
ax.set_title(col)
ax.set_xlabel('Month')
ax.set_ylabel('Value')
# Adjust layout and show the plot
plt.tight_layout()
plt.show()
For temperature (T), there is an overall increasing trend in the data. This is in line with common sense, as the seasons transition from winter to spring in the Northern Hemisphere. Other data show no significant features.
# Creating subplots
fig, axs = plt.subplots(4, 7, figsize=(14, 10))
# Counter for accessing columns
cpt = 0
# Looping through subplots and plotting histograms for each numerical column
for i in range(4):
for j in range(7):
var = numerical[cpt]
axs[i,j].hist(data[var].values, rwidth=0.9)
axs[i,j].set_title(var)
cpt += 1
fig.suptitle('Independent Variable Distribution')
# Adjusting layout and displaying the plot
plt.tight_layout()
plt.show()
We can observe the shape of the distribution for each feature. For example, features like temperature and humidity may show a normal distribution, while others might be skewed or have multiple peaks.
We can also see that rv1 and rv2 are just random values added, and could be ignored if data other than 'Appliances' is used.
- A closer look at Appliances
Appliances is our key study object, so we want a closer look at it.
# Plot the relationship between 'date' and 'Appliances'
plt.figure(figsize=(10, 6)) # Set the figure size
plt.plot(data['date'], data['Appliances'], color='#1f77b4', linestyle='-') # Plot with markers
plt.title('Relationship between Date and Appliances')
plt.xlabel('Date')
plt.ylabel('Appliances (Wh)')
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.grid(True) # Add gridlines
plt.tight_layout() # Adjust layout to prevent clipping of labels
plt.show()
From above we can see that there is no visually strong pattern between electricity comsumption -- 'Appliances', and date. However, this absence of a conspicuous pattern doesn't necessarily imply that the data lacks underlying structure. On the contrary, it underscores the importance of our task to uncover and understand the latent patterns hidden within the data.
# Calculating the simple moving averages (SMA) of the 'Appliances' column in the data DataFrame using different window sizes
window_size_1 = 100
window_size_2 = 200
window_size_3 = 500
sma_1 = data['Appliances'].rolling(window=window_size_1).mean() # get the mean in the time span of 100 past data entries
sma_2 = data['Appliances'].rolling(window=window_size_2).mean() # get the mean in the time span of 200 past data entries
sma_3 = data['Appliances'].rolling(window=window_size_3).mean() # get the mean in the time span of 300 past data entries
# Draw the plt
plt.figure(figsize=(10, 6))
plt.plot(data['date'], data['Appliances'], label='Original Data')
plt.plot(data['date'], sma_1, label='SMA (window size = {})'.format(window_size_1))
plt.plot(data['date'], sma_2, label='SMA (window size = {})'.format(window_size_2))
plt.plot(data['date'], sma_3, label='SMA (window size = {})'.format(window_size_3))
plt.xlabel('Date')
plt.ylabel('Appliances')
plt.title('Simple Moving Average')
plt.legend()
# Rotate x-axis labels by 45 degrees
plt.xticks(rotation=45)
plt.show()
After applying the Simple Moving Average (SMA) model with window sizes of 100, 200, and 500, and closely observing the resulting graphs, no significant trends were discernible at a first glance. This suggests that the 'Appliances' value remains relatively stable (stationary) throughout the duration of the data capture.
- Visualization of Randomly Selected Weekly Data
import random
len_weekly_data = 6*24*7 # 10 entries per minute => 6 times per hour => * 24 h * 7 days
# Function to randomly select 8 different weeks from the entire time range
def select_random_weeks(data):
# Determine the total number of complete weeks
num_weeks = len(data) // len_weekly_data
# Randomly select 8 different week indices
random_week_indices = random.sample(range(num_weeks), 8)
# Sort the selected week indices
random_week_indices.sort()
return random_week_indices
# Randomly select 8 different weeks
random_week_indices = select_random_weeks(data)
# Create subplots arranged in a 4x2 grid
fig, axs = plt.subplots(4, 2, figsize=(10, 12))
fig.suptitle('Visualization of Randomly Selected Weekly Appliances Data')
# Plotting each randomly selected week's data
for i in range(4):
for j in range(2):
# Calculate the start and end index for the randomly selected week
start_index = random_week_indices[i * 2 + j] * len_weekly_data
end_index = start_index + len_weekly_data
# Filter data for the randomly selected week
week_data = data.iloc[start_index:end_index]
# Plotting the relationship between 'date' and 'Appliances' for the randomly selected week
axs[i, j].plot(week_data['date'], week_data['Appliances'], color='#1f77b4', linestyle='-')
axs[i, j].set_title(f'Week {random_week_indices[i * 2 + j] + 1}')
axs[i, j].set_xlabel('Date')
axs[i, j].set_ylabel('Appliances (Wh)')
axs[i, j].tick_params(axis='x', rotation=45)
axs[i, j].grid(True)
# Adjust layout to prevent overlapping of subplots
plt.tight_layout()
plt.show()
Upon analyzing the randomly selected weekly data, a discernible pattern emerged, indicating consistent fluctuations in energy consumption. Despite the random sampling, recurring trends were evident, suggesting underlying patterns in appliance usage throughout the observed time period. This consistency hints at potential factors influencing energy consumption behavior. This aligns with our Hourly Analysis previously.
# Plotting the histogram for the "Appliances" column
plt.figure(figsize=(10, 4)) # Setting the figure size
bin_width = 25 # Define the bin width
max_value = int(max(data['Appliances']))
bins = range(0, max_value + bin_width, bin_width) # Define the bins with a gap of 50 units starting from 0
plt.hist(data['Appliances'], bins=bins, color='#1f77b4', edgecolor='white') # Plotting histogram with specified bins
plt.title('Distribution of Appliances') # Adding title
plt.xlabel('Appliances (Wh)') # Adding x-axis label
plt.ylabel('Frequency') # Adding y-axis label
plt.grid(True) # Adding gridlines
plt.tight_layout() # Adjusting layout to prevent clipping of labels
plt.show() # Displaying the plot
data.boxplot(column='Appliances', vert=False, figsize=(10, 4))
# Add title and labels
plt.title('Horizontal Box Plot of Appliances')
plt.xlabel('Value')
plt.ylabel('Appliances')
# Show the plot
plt.show()
The box plot has a long right-hand side tail, it indicates that it is positively skewed with some high energy use.
- Correlation analysis
# Delete date time stamps and irrelavant columns
new_data = new_data.drop(['Date', 'Time', 'rv1', 'rv2'], axis=1)
# Auto-check if there is any format that is not 'float64' or 'int64'
for column in new_data.columns:
if new_data[column].dtype != 'float64' and new_data[column].dtype != 'int64':
print(f"Column '{column}' has non-numeric data type: {new_data[column].dtype}")
Column 'hour' has non-numeric data type: int32
Column 'month' has non-numeric data type: int32
Column 'day_of_week' has non-numeric data type: int32
# We can see from above that the 'hour' 'month' and 'day_of_week' is of data type 'int32', which cannot be used for correlation analysis
# We should change the format first.
new_data['hour'] = new_data['hour'].astype(float)
new_data['month'] = new_data['month'].astype(float)
new_data['day_of_week'] = new_data['day_of_week'].astype(float)
# Check type again.
print(new_data.dtypes)
Appliances int64
lights int64
T1 float64
RH_1 float64
T2 float64
RH_2 float64
T3 float64
RH_3 float64
T4 float64
RH_4 float64
T5 float64
RH_5 float64
T6 float64
RH_6 float64
T7 float64
RH_7 float64
T8 float64
RH_8 float64
T9 float64
RH_9 float64
T_out float64
Press_mm_hg float64
RH_out float64
Windspeed float64
Visibility float64
Tdewpoint float64
hour float64
month float64
day_of_week float64
dtype: object
# Calculate correlation coefficients
correlations = new_data.corr()['Appliances'].drop(['Appliances'])
# Plotting the correlations
plt.figure(figsize=(15, 6))
# Define colors based on correlation values
colors = ['red' if corr < 0 else '#1f77b4' for corr in correlations]
# Plot the bar chart with custom colors
bars = correlations.plot(kind='bar', color=colors)
# Add text annotations
for i, corr in enumerate(correlations):
if corr < 0:
plt.text(i, corr, f"{corr:.2f}", ha='center', va='top', fontsize=8)
else:
plt.text(i, corr, f"{corr:.2f}", ha='center', va='bottom', fontsize=8)
plt.title('Correlation with Appliances')
plt.xlabel('Features')
plt.ylabel('Correlation Coefficient')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
We can observe that the strongest correlation, standing at 0.22, exists between appliances consumption and the time of day (hour). This is followed by lights with a correlation coefficient of 0.2, and air pressure with a correlation coefficient of -0.15. This indicates that 'hour' and 'lights' predominantly influence home appliances' electricity consumption.
Additionally, the positive correlation coefficients for 'hour' and 'lights' imply that as the time of day increases or as more lights are turned on, there is a tendency for higher appliances consumption. Conversely, the negative correlation coefficient for air pressure suggests a slight decrease in appliances consumption as air pressure rises. This highlights the significance of considering temporal factors and household activities in understanding and managing electricity consumption.
# Calculate pairwise correlations
corr_matrix = new_data.corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
# Set up the matplotlib figure
plt.figure(figsize=(12, 10))
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr_matrix, mask=mask, cmap='coolwarm', annot=True, fmt=".2f", linewidths=.5)
plt.title('Pairwise Correlation Heatmap')
plt.show()
Here I visually present the correlations among different features using a heatmap. It reveals strong positive correlations among temperatures and weak positive correlations among humidity levels. RH_6 (outdoor humidity) stands out as it exhibits strong negative correlations with temperatures, indicating that as temperatures rise, humidity levels decrease.
In contrast to the correlations among temperatures and humidities, the correlation between our target variable 'Appliances' and these variables appears insignificant.
- Clustering Analysis
# Selecting features for clustering (excluding 'Appliances' since it's the target feature) that with an abstract value of over 0.06.
data_cluster = new_data[['Appliances','lights','T2','RH_2','T3','T6','RH_6','RH_7','RH_8','T_out','Press_mm_hg','RH_out','hour']]
data_cluster
Appliances | lights | T2 | RH_2 | T3 | T6 | RH_6 | RH_7 | RH_8 | T_out | Press_mm_hg | RH_out | hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 60 | 30 | 19.200000 | 44.790000 | 19.790000 | 7.026667 | 84.256667 | 41.626667 | 48.900000 | 6.600000 | 733.5 | 92.000000 | 17.0 |
1 | 60 | 30 | 19.200000 | 44.722500 | 19.790000 | 6.833333 | 84.063333 | 41.560000 | 48.863333 | 6.483333 | 733.6 | 92.000000 | 17.0 |
2 | 50 | 30 | 19.200000 | 44.626667 | 19.790000 | 6.560000 | 83.156667 | 41.433333 | 48.730000 | 6.366667 | 733.7 | 92.000000 | 17.0 |
3 | 50 | 40 | 19.200000 | 44.590000 | 19.790000 | 6.433333 | 83.423333 | 41.290000 | 48.590000 | 6.250000 | 733.8 | 92.000000 | 17.0 |
4 | 60 | 40 | 19.200000 | 44.530000 | 19.790000 | 6.366667 | 84.893333 | 41.230000 | 48.590000 | 6.133333 | 733.9 | 92.000000 | 17.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
19730 | 100 | 0 | 25.890000 | 42.025714 | 27.200000 | 24.796667 | 1.000000 | 44.500000 | 50.074000 | 22.733333 | 755.2 | 55.666667 | 17.0 |
19731 | 90 | 0 | 25.754000 | 42.080000 | 27.133333 | 24.196667 | 1.000000 | 44.414286 | 49.790000 | 22.600000 | 755.2 | 56.000000 | 17.0 |
19732 | 270 | 10 | 25.628571 | 42.768571 | 27.050000 | 23.626667 | 1.000000 | 44.400000 | 49.660000 | 22.466667 | 755.2 | 56.333333 | 17.0 |
19733 | 420 | 10 | 25.414000 | 43.036000 | 26.890000 | 22.433333 | 1.000000 | 44.295714 | 49.518750 | 22.333333 | 755.2 | 56.666667 | 17.0 |
19734 | 430 | 10 | 25.264286 | 42.971429 | 26.823333 | 21.026667 | 1.000000 | 44.054000 | 49.736000 | 22.200000 | 755.2 | 57.000000 | 18.0 |
19735 rows × 13 columns
scaler = StandardScaler()
data_cluster_scaled = scaler.fit_transform(data_cluster)
After trying to use the CURE cluster analysis method, it was found that the classification results were poor. Combined with the reality that our data is more linear and evenly distributed, I finally chose Kmeans to do the cluster analysis.
from sklearn.cluster import KMeans
# Create a KMeans analyzer
kmeans = KMeans(n_clusters=5, random_state=42)
# Run the KMeans analysis
kmeans.fit(data_cluster_scaled)
# Get the cluster labels
cluster_labels = kmeans.labels_
# Add the 'cluster' column to the DataFrame
data_cluster['cluster'] = cluster_labels
# Print the size of each cluster
for cluster_label in range(5):
cluster_size = len(cluster_labels[cluster_labels == cluster_label])
print(f"Cluster {cluster_label}: {cluster_size} data points")
Cluster 0: 4822 data points
Cluster 1: 4732 data points
Cluster 2: 4236 data points
Cluster 3: 3328 data points
Cluster 4: 2617 data points
# Create a subplot grid containing all features
fig, axs = plt.subplots(3, 4, figsize=(16, 12))
# Flatten axs into a 1D array for iteration
axs = axs.flatten()
# Get the names of all feature columns except the target variable 'Appliances'
feature_columns = data_cluster.columns.drop('Appliances')
# Plot scatter plot of each feature against the target variable 'Appliances' and color by cluster
for i, feature in enumerate(feature_columns):
# Check if the subplot index is within bounds
if i < len(axs):
sns.scatterplot(x=feature, y='Appliances', hue='cluster', data=data_cluster, ax=axs[i], alpha=0.5, palette='Blues')
axs[i].set_xlabel(feature)
axs[i].set_ylabel('Appliances')
axs[i].legend(title='Cluster')
plt.suptitle('Scatter Plots of Features against Appliances with Cluster Information', fontsize=16)
# Adjust spacing and layout of subplots
plt.tight_layout()
plt.show()
The separation between clusters appears to be quite good and relatively uniform. Particularly in plots related to temperature, the separation is more pronounced, while it's less evident in humidity-related plots. The correlation with lights is also noticeable. This suggests that the dataset has been reasonably well-classified.
2.5 Data Pre-processing for prediction model training
For both ARIMA and LSTM models, while ARIMA is generally insensitive to feature scaling, it's essential to note that LSTM models, being a type of recurrent neural network (RNN), are sensitive to feature scaling. Fortunately, since there are no missing values in our dataset, imputation is unnecessary. This streamlined approach to data preprocessing ensures a smoother modeling process.
from sklearn.preprocessing import MinMaxScaler
dataset = data_target.values
dataset = dataset.astype('float32')
# normalize the dataset for LSTM
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(dataset)
# split into train and test sets
train_size = int(len(dataset) * 0.8)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]
3. Implement prediction models
Here I initially selected the ARIMA and LSTM models for training and making predictions. They are both renowned for their effectiveness in handling time series data.
Specifically, ARIMA excels when the data exhibit linear trends and seasonality. I presume the data exhibits near-linear trends, primarily because it was collected during a period characterized by steadily increasing temperatures. Additionally, an analysis conducted in the previous section indicates the presence of certain seasonal patterns within the data. These factors suggest that the data may adhere to a consistent upward trajectory over time, punctuated by recurring seasonal fluctuations. This attribute making it suitable for capturing patterns using ARIMA model. On the other hand, LSTM (Long Short-Term Memory) networks shine in capturing complex nonlinear relationships and long-term dependencies within sequential data, thus proving particularly valuable when dealing with non-stationary time series or those with intricate patterns over time.
However, in addition to these two specialized models, I also opted for several traditional machine learning algorithms that are not inherently designed for time series analysis, aiming to provide a comprehensive comparison. These models, though not tailored explicitly for time series data, offer different perspectives and approaches to prediction tasks, enriching the analysis and potentially uncovering valuable insights.
3.1 ARIMA model implementing
- Seasonal decomposition
from statsmodels.tsa.seasonal import seasonal_decompose
from pylab import rcParams
from matplotlib import pyplot
# Prepare data to be trained. Here we just use the target feature 'Appliances' together with the time feature 'date'
data_arima = data[['Appliances', 'date']]
data_arima = data_arima.set_index('date')
decomp_period = 6*24*7 # Assuming data is sampled every 10 minutes -- 1 week
# Use the lib to decompose a time series into its constituent components: trend, seasonality, and residuals
result = seasonal_decompose(data_arima, period=decomp_period, model='additive')
rcParams['figure.figsize'] = 12, 12
fig = result.plot()
pyplot.show()
# Explicitly store the result's attribute to new variables.
residual = result.resid
seasonal = result.seasonal
trend = result.trend
# Checking if there are any null values.
residual.isnull().values.any()
True
We got 'True' means that we have to delete null values.
# dopping NaN values
residual.dropna(inplace=True)
We can see from the above that the data could be considered stationary, so ideal d should be 0.
print(residual.head())
date
2016-01-15 05:00:00 -15.840400
2016-01-15 05:10:00 -22.652744
2016-01-15 05:20:00 -14.733727
2016-01-15 05:30:00 -15.757380
2016-01-15 05:40:00 -16.233571
Name: resid, dtype: float64
from statsmodels.tsa.stattools import acf, pacf
# Calculating the autocorrelation function (ACF) of the residuals with a lag of 20
acf_cases = acf(residual, nlags = 20)
# Calculating the partial autocorrelation function (PACF) of the residuals with a lag of 20 using the OLS (Ordinary Least Squares) method
pacf_cases = pacf(residual, nlags = 20, method = 'ols')
rcParams['figure.figsize'] = 12, 4
plt.subplot(121)
plt.plot(acf_cases)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(residual)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(residual)),linestyle='--',color='gray')
plt.title('Autocorrelation Function')
plt.subplot(122)
plt.plot(pacf_cases)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(residual)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(residual)),linestyle='--',color='gray')
plt.title('Partial Autocorrelation Function')
plt.tight_layout()
- ACF & PACF
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
plt.figure(figsize=(10, 4)) # Set the figure size
plot_acf(residual, lags = 40)
plt.ylim(0, 1.2) # set the range of Y axis
(0.0, 1.2)
<Figure size 1000x400 with 0 Axes>
plt.figure(figsize=(10, 4)) # Set the figure size
plot_pacf(residual, lags = 40)
plt.ylim(-0.2, 1.2) # set the range of Y axis
(-0.2, 1.2)
<Figure size 1000x400 with 0 Axes>
Here we can see that lag before 2 of the lags are out of the significance limit or below the threshold so we can say that the optimal value of our p and q is 3.
3.2 LSTM model implementing
# LSTM for international airline passengers problem with time step regression framing
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from sklearn.metrics import mean_squared_error
For LSTM model training, I have to define a function called create_dataset that converts an array of values into a dataset matrix suitable for training. Use a loop to create the input-output pairs for the dataset matrix. For each iteration, the function extracts a sequence of look_back values from the dataset to form an input feature array a.
# Function to convert an array of values into a dataset matrix, default look back set to 1.
def create_dataset(dataset, look_back=1):
dataX, dataY = [], []
for i in range(len(dataset)-look_back-1):
a = dataset[i:(i+look_back), 0]
dataX.append(a)
dataY.append(dataset[i + look_back, 0])
return np.array(dataX), np.array(dataY)
# Fix random seed for reproducibility
tf.random.set_seed(7)
# Reshape into X=t and Y=t+1
look_back = 3 # explicitly set look back value
trainX, trainY = create_dataset(train, look_back)
testX, testY = create_dataset(test, look_back)
# Reshape input to be [samples, time steps, features]
trainX = np.reshape(trainX, (trainX.shape[0], trainX.shape[1], 1))
testX = np.reshape(testX, (testX.shape[0], testX.shape[1], 1))
3.3 Machine Learning Methods
- Preparing data for machine learning methods
from sklearn.ensemble import GradientBoostingRegressor,RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.model_selection import train_test_split
x = data.drop(['Appliances','date','Windspeed','Visibility','Tdewpoint','rv1','rv2'],axis=1)
y = data['Appliances'] # set target feature
# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
- import random forest regressor and gradient boosting regressor
# 2 models
rf = RandomForestRegressor(random_state=42)
gb = GradientBoostingRegressor(random_state=42)
4. Train prediction models
4.1 ARIMA model training
from statsmodels.tsa.arima.model import ARIMA
# Define ARIMA model, with p, d, q as 3, 0, 3
model_AR = ARIMA(residual, order=(3, 0, 3), freq='10T')
# Fit model
results_AR = model_AR.fit() # 10 mins per capture
c:\Users\sulij\anaconda3\envs\mbd\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency 10min will be used.
self._init_dates(dates, freq)
4.2 LSTM model training
In the experimental stage, I conducted experiments with epochs=10 and concluded that 'the value of loss tends to approach the minimum when epochs >= 3'. Therefore, for the convenience of repeated experiments, I have set epochs directly to 3 here, which can shorten the runtime.
# create and fit the LSTM network
model = Sequential()
model.add(LSTM(4, input_shape=(look_back, 1)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
# from previous tests, I have already know that after ephochs 3, the loss rate will tend to be mininum. So choose 3 as our epochs.
model.fit(trainX, trainY, epochs=3, batch_size=1, verbose=2)
Epoch 1/3
15784/15784 - 19s - 1ms/step - loss: 0.0048
Epoch 2/3
15784/15784 - 17s - 1ms/step - loss: 0.0041
Epoch 3/3
15784/15784 - 18s - 1ms/step - loss: 0.0040
<keras.src.callbacks.history.History at 0x224d0497c10>
4.3 Random Forest model training
from sklearn.model_selection import RandomizedSearchCV
# Define the hyperparameter grid
params = {'n_estimators': [100,200,500],'max_depth':[3,5,8]}
# Reduce the number of folds for cross-validation
folds = 3
# Use RandomizedSearchCV instead of GridSearchCV
model_cv_1 = RandomizedSearchCV(estimator=rf,
param_distributions=params,
scoring='neg_mean_absolute_error',
cv=folds,
return_train_score=True,
n_iter=10, # Number of parameter settings sampled
verbose=1,
n_jobs=-1)
# Fit the model
model_cv_1.fit(x_train, y_train)
Fitting 3 folds for each of 9 candidates, totalling 27 fits
4.4 Gradient Boosting Regressor training
# GradientBoostingRegressor
params = {'max_depth': [1,2,3]}
# cross validation
folds = 3
model_cv_2 = GridSearchCV(estimator = gb,
param_grid = params,
scoring= 'neg_mean_absolute_error',
cv = folds,
return_train_score=True,
verbose = 1)
model_cv_2.fit(x_train, y_train)
Fitting 3 folds for each of 3 candidates, totalling 9 fits
5. Test predictin models and show results
5.1 ARIMA model predictions
- Residual predictions
# ARIMA
plt.figure(figsize=(10, 6)) # Set the figure size
# Draw the outcome
plt.plot(residual, label='Actual')
plt.plot(results_AR.fittedvalues, color='red', label='Predicted')
plt.xlabel("Date")
plt.ylabel("Residual")
plt.legend(loc="best")
<matplotlib.legend.Legend at 0x224df779990>
- Final predictions
The autoregressive (AR) model relies on past observations to predict future values, so I have carefully chosed the appropriate start time and end time.
# We choose a later period of time to make predictions.
start_time = "2016-01-15 17:00:00"
end_time = "2016-05-27 18:00:00"
predictions = results_AR.predict(start=start_time, end=end_time, dynamic=False, freq='10T')
predictions
2016-01-15 17:00:00 -69.068783
2016-01-15 17:10:00 -44.970132
2016-01-15 17:20:00 -37.958415
2016-01-15 17:30:00 -59.941219
2016-01-15 17:40:00 -47.876292
...
2016-05-27 17:20:00 -0.155686
2016-05-27 17:30:00 -0.155686
2016-05-27 17:40:00 -0.155686
2016-05-27 17:50:00 -0.155686
2016-05-27 18:00:00 -0.155686
Freq: 10min, Name: predicted_mean, Length: 19159, dtype: float64
# Add back trend + seasonal, or otherwise the whole prediction will be definitely lower than original data
final_predictions = predictions + trend + seasonal
final_predictions.shape
(19735,)
plt.figure(figsize=(10, 6)) # Set the figure size
# Plot original data
plt.plot(data['date'], data_arima['Appliances'], label='Original Data')
# Plot predictions
plt.plot(data['date'], final_predictions, color='red', label='Predictions')
# Set labels and title
plt.xlabel('Date')
plt.ylabel('Appliances (Wh)')
plt.title('ARIMA Model Predictions')
# Add legend
plt.legend()
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.grid(True) # Add gridlines
# Show plot
plt.show()
5.2 LSTM predictions
# make predictions
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)
# invert predictions
trainPredict = scaler.inverse_transform(trainPredict)
trainY = scaler.inverse_transform([trainY])
testPredict = scaler.inverse_transform(testPredict)
testY = scaler.inverse_transform([testY])
# shift train predictions for plotting
trainPredictPlot = np.empty_like(dataset)
trainPredictPlot[:, :] = np.nan
trainPredictPlot[look_back:len(trainPredict)+look_back, :] = trainPredict
# shift test predictions for plotting
testPredictPlot = np.empty_like(dataset)
testPredictPlot[:, :] = np.nan
testPredictPlot[len(trainPredict)+(look_back*2)+1:len(dataset)-1, :] = testPredict
[1m494/494[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step
[1m124/124[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 886us/step
# Plot baseline and predictions
plt.figure(figsize=(10, 6)) # Set the figure size
plt.plot(data['date'], data['Appliances'], color='#1f77b4', linestyle='-', label='Original Data') # Plot with markers
plt.plot(data['date'], trainPredictPlot, color='red', linestyle='-', label='Training Predictions') # Plot with markers
plt.plot(data['date'], testPredictPlot, color='green', linestyle='-', label='Test Predictions') # Plot with markers
plt.title('LSTM Model Predictions')
plt.xlabel('Date')
plt.ylabel('Appliances (Wh)')
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.grid(True) # Add gridlines
plt.tight_layout() # Adjust layout to prevent clipping of labels
plt.legend() # Add legend
plt.show()
5.3 Random Forest predictions
#final RandomForestRegressor
rf = RandomForestRegressor(max_depth=8, n_estimators=500,random_state=42)
rf.fit(x_train, y_train)
RandomForestRegressor(max_depth=8, n_estimators=500, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. RandomForestRegressor?Documentation for RandomForestRegressoriFitted
RandomForestRegressor(max_depth=8, n_estimators=500, random_state=42)
rf_test_pred = rf.predict(x_test)
5.4 Gradient Boosting Regressor predictions
#final GradientBoostingRegressor
gb = GradientBoostingRegressor(max_depth=5,random_state=42)
gb.fit(x_train, y_train)
GradientBoostingRegressor(max_depth=5, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. GradientBoostingRegressor?Documentation for GradientBoostingRegressoriFitted
GradientBoostingRegressor(max_depth=5, random_state=42)
gb_test_pred = gb.predict(x_test)
6. Compare the results from all candidate models, choose the best model, justify your choice and discuss the results
Compare the results from all candidate models, choose the best model, justify your choice and discuss the results.
- ARIMA
data_subset = data_arima['Appliances'][(data_arima.index >= start_time) & (data_arima.index <= end_time)]
# Calculate MAE, RMSE, and MAPE for the combined dataset
mae_ar = np.mean(np.abs(data_subset- final_predictions))
mse_ar = np.mean((data_subset- final_predictions) ** 2)
rmse_ar = np.sqrt(mse_ar)
mape_ar = np.mean(np.abs((data_subset- final_predictions) / data_subset)) * 100
print("Mean Absolute Error (MAE) for ARIMA:", mae_ar)
print("Root Mean Squared Error (RMSE) for ARIMA:", rmse_ar)
print("Mean Absolute Percentage Error (MAPE) for ARIMA:", mape_ar)
Mean Absolute Error (MAE) for ARIMA: 30.430244505967522
Root Mean Squared Error (RMSE) for ARIMA: 62.33467043574461
Mean Absolute Percentage Error (MAPE) for ARIMA: 31.594001482855322
- LSTM
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Combine trainPredict and testPredict into a single dataset
combined_predictions = np.concatenate((trainPredict[:,0], testPredict[:,0]))
# Calculate MAE, RMSE, and MAPE for the combined dataset
mae_lstm = mean_absolute_error(np.concatenate((trainY[0], testY[0])), combined_predictions)
rmse_lstm = np.sqrt(mean_squared_error(np.concatenate((trainY[0], testY[0])), combined_predictions))
mape_lstm = np.mean(np.abs((np.concatenate((trainY[0], testY[0])) - combined_predictions) / np.concatenate((trainY[0], testY[0]))) * 100)
print("Mean Absolute Error (MAE) for LSTM:", mae_lstm)
print("Root Mean Squared Error (RMSE) for LSTM:", rmse_lstm)
print("Mean Absolute Percentage Error (MAPE) for LSTM:", mape_lstm)
Mean Absolute Error (MAE) for LSTM: 29.576427576887188
Root Mean Squared Error (RMSE) for LSTM: 65.57357656216654
Mean Absolute Percentage Error (MAPE) for LSTM: 29.764518825012665
- Random Forest
# Calculate MAE, RMSE, and MAPE for the combined dataset
mae_rf = metrics.mean_absolute_error(y_test, rf_test_pred)
rmse_rf=np.sqrt(metrics.mean_squared_error(y_test, rf_test_pred))
mape_rf = np.mean(np.abs((y_test - rf_test_pred) / y_test)) * 100
print("Mean Absolute Error (MAE) for RF:", mae_rf)
print("Root Mean Squared Error (RMSE) for RF:", rmse_rf)
print("Mean Absolute Percentage Error (MAPE) for RF:", mape_rf)
Mean Absolute Error (MAE) for RF: 44.99424619020264
Root Mean Squared Error (RMSE) for RF: 83.49559904823651
Mean Absolute Percentage Error (MAPE) for RF: 52.11667094759201
- Gradient Boosting
# Calculate MAE, RMSE, and MAPE for the combined dataset
mae_gb = metrics.mean_absolute_error(y_test, gb_test_pred)
rmse_gb=np.sqrt(metrics.mean_squared_error(y_test, gb_test_pred))
mape_gb = np.mean(np.abs((y_test - gb_test_pred) / y_test)) * 100
print("Mean Absolute Error (MAE) for GB:", mae_gb)
print("Root Mean Squared Error (RMSE) for GB:", rmse_gb)
print("Mean Absolute Percentage Error (MAPE) for GB:", mape_gb)
Mean Absolute Error (MAE) for GB: 41.618740815093545
Root Mean Squared Error (RMSE) for GB: 78.7710393221513
Mean Absolute Percentage Error (MAPE) for GB: 46.484278532018166
import numpy as np
import matplotlib.pyplot as plt
models = ['ARIMA', 'LSTM', 'RF', 'GB']
mae_scores = [mae_ar, mae_lstm, mae_rf, mae_gb]
rmse_scores = [rmse_ar, rmse_lstm, rmse_rf, rmse_gb]
mape_scores = [mape_ar, mape_lstm, mape_rf, mape_gb]
bar_width = 0.25
index = np.arange(len(models))
plt.bar(index - bar_width, mae_scores, bar_width, label='MAE', color='red')
plt.bar(index, rmse_scores, bar_width, label='RMSE', color=color)
plt.bar(index + bar_width, mape_scores, bar_width, label='MAPE', color='skyblue')
for i in range(len(models)):
plt.text(index[i] - bar_width, mae_scores[i] + 0.01, str(round(mae_scores[i], 2)), ha='center', va='bottom')
plt.text(index[i], rmse_scores[i] + 0.01, str(round(rmse_scores[i], 2)), ha='center', va='bottom')
plt.text(index[i] + bar_width, mape_scores[i] + 0.01, str(round(mape_scores[i], 2)), ha='center', va='bottom')
plt.xlabel('Models')
plt.ylabel('Scores')
plt.title('Comparison of Model Scores')
plt.xticks(index, models)
plt.legend()
plt.show()
From the graph, we can see that the model scores of ARIMA and LSTM are very close, placing them in the top tier. Following closely are Gradient Boosting and Random Forest.
The flection from this is that model choosing and parameters determination is a complex process. In this assignment, given your data collected every 10 minutes over a period of over 4 months, both ARIMA and LSTM models are viable options for modeling and prediction. And since we can have prior knowledge of the time series features, ARIMA might be a suitable choice. The fact that by simple parameter selection process, it reached a relatively good performance.
Traditional machine learning methods like RF and GB can be used for this kind of problems but it's usually harder to implement compared to ARIMA and LSTM, and the result is not ideal.
7. Reflect on what you have learned by completing this assignment
It was my first foray into this field, and the project provided me with valuable insights into the characteristics of time series data and introduced me to relevant analytical methods.
After EDA, I identified the seasonal feature of the dataset and segmented the main feature into trend, seasonal, and residuals using the 'tsa' library for ARIMA model implementation. This step is crucial as ARIMA focuses on studying and predicting residuals. I then selected the appropriate parameters for p, q, and d by observing ACF and PACF. In the case of LSTM model, I learned the importance of data scaling. The model autonomously optimized its parameters using 'epochs', akin to the process in Random Forest and Gradient Boosting, where 'scikit-learn' searches for the best hyperparameters within specified distributions. Upon comparing the performance of various models, I realized that while all four models achieved similar outcomes, the selection of model could significantly impact prediction accuracy.
Throughout the entire process, I dedicated over 20 hours to studying relevant concepts online, reviewing provided materials, testing models, and troubleshooting issues. This endeavor deepened my understanding of time series forecasting. However, I am acutely aware that refining models for enhanced efficiency and accuracy is an ongoing journey. I view this experience as a promising starting point for my future studies.
8. References
https://www.kaggle.com/code/gaganmaahi224/appliances-energy-time-series-analysis#ARMA-Models
(for data preprocessing)
https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
(for ARIMA model learning and parameters determination)
https://machinelearningmastery.com/decompose-time-series-data-trend-seasonality/
(for understanding seasonal decomposition)
Top comments (0)