DEV Community

Cover image for Time Series Forecasting for Appliances Energy Prediction
Lucas Su
Lucas Su

Posted on

Time Series Forecasting for Appliances Energy Prediction

Time Series Forecasting

Original Data Link: Appliances Energy Prediction Dataset

Target:

  • Use suitable time series analysis techniques.
  • Apply at least two time-series forecasting methods.
  • Compare the results from all candidate models.
  • Choose the best model.
  • Justify your choice and discuss the results.

1. Read the dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,accuracy_score
import warnings
warnings.filterwarnings('ignore')

# read dataset from csv file
data = pd.read_csv('energydata_complete.csv')
# load the dataset
data_target = pd.read_csv('energydata_complete.csv', usecols=[1], engine='python')

# set standard color variable
color = "#1f77b4"

# display first few lines to make sure correctly import
data.head()
Enter fullscreen mode Exit fullscreen mode
date Appliances lights T1 RH_1 T2 RH_2 T3 RH_3 T4 ... T9 RH_9 T_out Press_mm_hg RH_out Windspeed Visibility Tdewpoint rv1 rv2
0 2016-01-11 17:00:00 60 30 19.89 47.596667 19.2 44.790000 19.79 44.730000 19.000000 ... 17.033333 45.53 6.600000 733.5 92.0 7.000000 63.000000 5.3 13.275433 13.275433
1 2016-01-11 17:10:00 60 30 19.89 46.693333 19.2 44.722500 19.79 44.790000 19.000000 ... 17.066667 45.56 6.483333 733.6 92.0 6.666667 59.166667 5.2 18.606195 18.606195
2 2016-01-11 17:20:00 50 30 19.89 46.300000 19.2 44.626667 19.79 44.933333 18.926667 ... 17.000000 45.50 6.366667 733.7 92.0 6.333333 55.333333 5.1 28.642668 28.642668
3 2016-01-11 17:30:00 50 40 19.89 46.066667 19.2 44.590000 19.79 45.000000 18.890000 ... 17.000000 45.40 6.250000 733.8 92.0 6.000000 51.500000 5.0 45.410389 45.410389
4 2016-01-11 17:40:00 60 40 19.89 46.333333 19.2 44.530000 19.79 45.000000 18.890000 ... 17.000000 45.40 6.133333 733.9 92.0 5.666667 47.666667 4.9 10.084097 10.084097

5 rows × 29 columns

Variables Description:

T1: Temperature in kitchen area, in Celsius

T2: Temperature in living room area, in Celsius

T3: Temperature in laundry room area, in Celsius

T4: Temperature in office room, in Celsius

T5: Temperature in bathroom, in Celsius

T6: Temperature outside the building (north side), in Celsius

T7: Temperature in ironing room , in Celsius

T8: Temperature in teenager room 2, in Celsius

T9: Temperature in parents room, in Celsius

RHI: Humidity in kitchen area, in %

RH2: Humidity in living room area, in %

RH3: Humidity in laundry room area, in %

RH4: Humidity in office room, in %

RH5: Humidity in bathroom, in %

RH6: Humidity outside the building (north side), in %

RH7: Humidity in ironing room, in %

RH8: Humidity in teenager room 2,in %

RH9: Humidity in parents room, in %

To: Temperature outside (from Chievres weather station), in Celsius

Pressure: (from Chievres weather station), in mm Hg

Hg RHout: Humidity outside (from Chievres weather station), in %

Wind speed: (from Chievres weather station), in m/s

Visibility: (from Chievres weather station), in km

Tdewpoint: (from Chievres weather station), A*C

Appliances, energy use in Wh: Dependent variable

2. Analyse and visualise the data

2.1 Check data's basic information

# Check data scale
data.shape
Enter fullscreen mode Exit fullscreen mode
(19735, 29)
Enter fullscreen mode Exit fullscreen mode
# Check data types
data.dtypes
Enter fullscreen mode Exit fullscreen mode
date            object
Appliances       int64
lights           int64
T1             float64
RH_1           float64
T2             float64
RH_2           float64
T3             float64
RH_3           float64
T4             float64
RH_4           float64
T5             float64
RH_5           float64
T6             float64
RH_6           float64
T7             float64
RH_7           float64
T8             float64
RH_8           float64
T9             float64
RH_9           float64
T_out          float64
Press_mm_hg    float64
RH_out         float64
Windspeed      float64
Visibility     float64
Tdewpoint      float64
rv1            float64
rv2            float64
dtype: object
Enter fullscreen mode Exit fullscreen mode

2.2 Handle missing values

# Check for missing values
data.isnull().sum()
Enter fullscreen mode Exit fullscreen mode
date           0
Appliances     0
lights         0
T1             0
RH_1           0
T2             0
RH_2           0
T3             0
RH_3           0
T4             0
RH_4           0
T5             0
RH_5           0
T6             0
RH_6           0
T7             0
RH_7           0
T8             0
RH_8           0
T9             0
RH_9           0
T_out          0
Press_mm_hg    0
RH_out         0
Windspeed      0
Visibility     0
Tdewpoint      0
rv1            0
rv2            0
dtype: int64
Enter fullscreen mode Exit fullscreen mode

From the abundance of 0s observed in the dataset, it's evident that there are no missing values present. Consequently, there's no need for further handling of missing values.

# Check if we have any duplicated data
any(data.duplicated())
Enter fullscreen mode Exit fullscreen mode
False
Enter fullscreen mode Exit fullscreen mode
# Convert 'date' column to datetime format, so that we could use it for analyse with time series functionality
data['date'] = pd.to_datetime(data['date'])
Enter fullscreen mode Exit fullscreen mode

The presence of 'False' indicates the non-existence of duplicate data.

2.3 Exploratory Data Analysis (EDA)

  • Add Features

Since the target variable in the dataset is the electricity consumption of household appliances ('Appliances'), it may be influenced by other features such as temperature. However, intuitively, electricity consumption is directly linked to time. For example, electricity usage in the evening is expected to be higher than during the day because lights are commonly used, and there may be more household activities like watching TV. Conversely, from midnight to early morning, the target variable should be at its lowest as most people are asleep, resulting in minimal electricity usage.

Therefore, to better observe the distribution of the target variable over time, we have introduced the following variables.

# Copy dataset for time-based analysis
new_data = data.copy()
new_data
Enter fullscreen mode Exit fullscreen mode
date Appliances lights T1 RH_1 T2 RH_2 T3 RH_3 T4 ... T9 RH_9 T_out Press_mm_hg RH_out Windspeed Visibility Tdewpoint rv1 rv2
0 2016-01-11 17:00:00 60 30 19.890000 47.596667 19.200000 44.790000 19.790000 44.730000 19.000000 ... 17.033333 45.5300 6.600000 733.5 92.000000 7.000000 63.000000 5.300000 13.275433 13.275433
1 2016-01-11 17:10:00 60 30 19.890000 46.693333 19.200000 44.722500 19.790000 44.790000 19.000000 ... 17.066667 45.5600 6.483333 733.6 92.000000 6.666667 59.166667 5.200000 18.606195 18.606195
2 2016-01-11 17:20:00 50 30 19.890000 46.300000 19.200000 44.626667 19.790000 44.933333 18.926667 ... 17.000000 45.5000 6.366667 733.7 92.000000 6.333333 55.333333 5.100000 28.642668 28.642668
3 2016-01-11 17:30:00 50 40 19.890000 46.066667 19.200000 44.590000 19.790000 45.000000 18.890000 ... 17.000000 45.4000 6.250000 733.8 92.000000 6.000000 51.500000 5.000000 45.410389 45.410389
4 2016-01-11 17:40:00 60 40 19.890000 46.333333 19.200000 44.530000 19.790000 45.000000 18.890000 ... 17.000000 45.4000 6.133333 733.9 92.000000 5.666667 47.666667 4.900000 10.084097 10.084097
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
19730 2016-05-27 17:20:00 100 0 25.566667 46.560000 25.890000 42.025714 27.200000 41.163333 24.700000 ... 23.200000 46.7900 22.733333 755.2 55.666667 3.333333 23.666667 13.333333 43.096812 43.096812
19731 2016-05-27 17:30:00 90 0 25.500000 46.500000 25.754000 42.080000 27.133333 41.223333 24.700000 ... 23.200000 46.7900 22.600000 755.2 56.000000 3.500000 24.500000 13.300000 49.282940 49.282940
19732 2016-05-27 17:40:00 270 10 25.500000 46.596667 25.628571 42.768571 27.050000 41.690000 24.700000 ... 23.200000 46.7900 22.466667 755.2 56.333333 3.666667 25.333333 13.266667 29.199117 29.199117
19733 2016-05-27 17:50:00 420 10 25.500000 46.990000 25.414000 43.036000 26.890000 41.290000 24.700000 ... 23.200000 46.8175 22.333333 755.2 56.666667 3.833333 26.166667 13.233333 6.322784 6.322784
19734 2016-05-27 18:00:00 430 10 25.500000 46.600000 25.264286 42.971429 26.823333 41.156667 24.700000 ... 23.200000 46.8450 22.200000 755.2 57.000000 4.000000 27.000000 13.200000 34.118851 34.118851

19735 rows × 29 columns

# Converting date into datetime
new_data['date'] = new_data['date'].astype('datetime64[ns]')
new_data['Date'] = pd.to_datetime(new_data['date']).dt.date
new_data['Time'] = pd.to_datetime(new_data['date']).dt.time
new_data['hour'] = new_data['date'].dt.hour
new_data['month'] = new_data['date'].dt.month
new_data['day_of_week'] = new_data['date'].dt.dayofweek

new_data= new_data.drop(["date"], axis=1)
new_data
Enter fullscreen mode Exit fullscreen mode
Appliances lights T1 RH_1 T2 RH_2 T3 RH_3 T4 RH_4 ... Windspeed Visibility Tdewpoint rv1 rv2 Date Time hour month day_of_week
0 60 30 19.890000 47.596667 19.200000 44.790000 19.790000 44.730000 19.000000 45.566667 ... 7.000000 63.000000 5.300000 13.275433 13.275433 2016-01-11 17:00:00 17 1 0
1 60 30 19.890000 46.693333 19.200000 44.722500 19.790000 44.790000 19.000000 45.992500 ... 6.666667 59.166667 5.200000 18.606195 18.606195 2016-01-11 17:10:00 17 1 0
2 50 30 19.890000 46.300000 19.200000 44.626667 19.790000 44.933333 18.926667 45.890000 ... 6.333333 55.333333 5.100000 28.642668 28.642668 2016-01-11 17:20:00 17 1 0
3 50 40 19.890000 46.066667 19.200000 44.590000 19.790000 45.000000 18.890000 45.723333 ... 6.000000 51.500000 5.000000 45.410389 45.410389 2016-01-11 17:30:00 17 1 0
4 60 40 19.890000 46.333333 19.200000 44.530000 19.790000 45.000000 18.890000 45.530000 ... 5.666667 47.666667 4.900000 10.084097 10.084097 2016-01-11 17:40:00 17 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
19730 100 0 25.566667 46.560000 25.890000 42.025714 27.200000 41.163333 24.700000 45.590000 ... 3.333333 23.666667 13.333333 43.096812 43.096812 2016-05-27 17:20:00 17 5 4
19731 90 0 25.500000 46.500000 25.754000 42.080000 27.133333 41.223333 24.700000 45.590000 ... 3.500000 24.500000 13.300000 49.282940 49.282940 2016-05-27 17:30:00 17 5 4
19732 270 10 25.500000 46.596667 25.628571 42.768571 27.050000 41.690000 24.700000 45.730000 ... 3.666667 25.333333 13.266667 29.199117 29.199117 2016-05-27 17:40:00 17 5 4
19733 420 10 25.500000 46.990000 25.414000 43.036000 26.890000 41.290000 24.700000 45.790000 ... 3.833333 26.166667 13.233333 6.322784 6.322784 2016-05-27 17:50:00 17 5 4
19734 430 10 25.500000 46.600000 25.264286 42.971429 26.823333 41.156667 24.700000 45.963333 ... 4.000000 27.000000 13.200000 34.118851 34.118851 2016-05-27 18:00:00 18 5 4

19735 rows × 33 columns

  • Hourly distribution of target variable
# Calculate the total energy consumed by the appliance per hour
app_hour = new_data.groupby(by='hour',as_index=False)['Appliances'].sum()
# Sort app_hour by descending order
app_hour.sort_values(by='Appliances',ascending=False)
Enter fullscreen mode Exit fullscreen mode
hour Appliances
18 18 156670
17 17 133600
19 19 117600
11 11 109430
20 20 104380
10 10 103060
13 13 102540
12 12 101630
16 16 98560
9 9 92710
14 14 89010
8 8 87250
15 15 86990
21 21 79320
7 7 64650
22 22 56840
6 6 47440
23 23 46840
0 0 43390
5 5 43350
1 1 42190
4 4 40570
2 2 40340
3 3 39650
# Function to format the numerical value to 'k' format
def format_value(value):
    if value < 1000:
        return f"{value:.0f}"
    elif value < 10000:
        return f"{value/1000:.1f}k"
    else:
        return f"{value/1000:.0f}k"

# Plotting the data
plt.figure(figsize=(10, 4))  # Set the figure size
bars = plt.bar(app_hour['hour'], app_hour['Appliances'], color=color)  # Create a bar plot

# Add numerical values on top of each bar
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width() / 2, height, format_value(height), ha='center', va='bottom')

plt.xlabel('Hour')  # Set the x-axis label
plt.ylabel('Energy Consumed by Appliances')  # Set the y-axis label
plt.title('Energy Consumption by Hour')  # Set the title
plt.xticks(app_hour['hour'])  # Set the x-axis ticks to match the hours
plt.grid(axis='y', linestyle='--', alpha=0.7)  # Add gridlines to the y-axis
plt.show()  # Show the plot

Enter fullscreen mode Exit fullscreen mode

Image description

The graph above confirms our hypothesis: electricity consumption is highest between 17:00 and 20:00, and lowest between 23:00 and 06:00 the following day. This aligns with our prediction above, so it could also serves as evidence of the reliability of the data.

  • Weekday distribution of target variable
# Calculate the total energy consumed by the appliance per hour
app_week_day = new_data.groupby(by='day_of_week',as_index=False)['Appliances'].sum()
# Sort app_hour by descending order
app_week_day.sort_values(by='Appliances',ascending=False)
Enter fullscreen mode Exit fullscreen mode
day_of_week Appliances
0 0 309610
4 4 297650
5 5 290690
3 3 260450
6 6 259690
2 2 259000
1 1 250920
day_names = {1: 'Monday', 2: 'Tuesday', 3: 'Wednesday', 4: 'Thursday', 5: 'Friday', 6: 'Saturday', 0: 'Sunday'}

# Plotting the data
plt.figure(figsize=(10, 4))  # Set the figure size
bars = plt.bar(app_week_day['day_of_week'], app_week_day['Appliances'], color=color)  # Create a bar plot

# Add numerical values on top of each bar
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width() / 2, height, format_value(height), ha='center', va='bottom')

plt.xlabel('Day of the week')  # Set the x-axis label
plt.ylabel('Energy Consumed by Appliances')  # Set the y-axis label
plt.title('Energy Consumption by day_of_week')  # Set the title
plt.xticks(list(day_names.keys()), list(day_names.values())) 
plt.grid(axis='y', linestyle='--', alpha=0.7)  # Add gridlines to the y-axis
plt.show()  # Show the plot

Enter fullscreen mode Exit fullscreen mode

Image description

As what I did to the hours analysis within one day, there could be pattern during the time span of a week. However, when split by the day of the week, the distribution of the target variable does not exhibit significant features. The only noteworthy point is that the electricity consumption on Sundays is significantly higher than on other days, approximately 10%.

  • General view of the whole dataset
# Filter columns excluding 'rv1' and 'rv2'
selected_columns = [col for col in data.columns[2:] if col not in ['rv1', 'rv2']]

# Determine the number of rows needed for the subplots
num_cols = len(selected_columns)
num_rows = (num_cols - 1) // 5 + 1

# Create subplots
fig, axs = plt.subplots(num_rows, 5, figsize=(14, num_rows*4))

# Loop through selected columns and plot time series on subplots
for i in range(num_rows):
    for j in range(5):
        index = i * 5 + j
        if index < num_cols:
            var = selected_columns[index]
            sns.lineplot(y=data[var], x=data['date'], ax=axs[i,j], linewidth=1.5)
            axs[i,j].set_xlabel('Date')
            axs[i,j].set_ylabel(var)
            axs[i,j].set_title('{} Time Series Data'.format(var))
            axs[i,j].tick_params(axis='x', rotation=90)
        else:
            axs[i,j].axis('off')

plt.tight_layout()
plt.show()

Enter fullscreen mode Exit fullscreen mode

Image description

# Extracting numerical columns from the data
numerical = ['Appliances', 'lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3', 'RH_3', 
             'T4', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8', 
             'RH_8', 'T9', 'RH_9', 'T_out', 'Press_mm_hg', 'RH_out', 
             'Windspeed', 'Visibility', 'Tdewpoint', 'rv1', 'rv2']

# Create subplots
fig, axs = plt.subplots(4, 7, figsize=(14, 10))

# Flatten the array of axes for easy iteration
axs = axs.flatten()

# Loop through each numerical column and plot its distribution
for i, col in enumerate(numerical):
    ax = axs[i]
    sns.boxplot(x=data['date'].dt.month, y=data[col], ax=ax)
    ax.set_title(col)
    ax.set_xlabel('Month')
    ax.set_ylabel('Value')

# Adjust layout and show the plot
plt.tight_layout()
plt.show()
Enter fullscreen mode Exit fullscreen mode

Image description

For temperature (T), there is an overall increasing trend in the data. This is in line with common sense, as the seasons transition from winter to spring in the Northern Hemisphere. Other data show no significant features.

# Creating subplots
fig, axs = plt.subplots(4, 7, figsize=(14, 10))

# Counter for accessing columns
cpt = 0

# Looping through subplots and plotting histograms for each numerical column
for i in range(4):
    for j in range(7):
        var = numerical[cpt]
        axs[i,j].hist(data[var].values, rwidth=0.9)
        axs[i,j].set_title(var)
        cpt += 1

fig.suptitle('Independent Variable Distribution')

# Adjusting layout and displaying the plot
plt.tight_layout()
plt.show()

Enter fullscreen mode Exit fullscreen mode

Image description

We can observe the shape of the distribution for each feature. For example, features like temperature and humidity may show a normal distribution, while others might be skewed or have multiple peaks.

We can also see that rv1 and rv2 are just random values added, and could be ignored if data other than 'Appliances' is used.

  • A closer look at Appliances

Appliances is our key study object, so we want a closer look at it.

# Plot the relationship between 'date' and 'Appliances'
plt.figure(figsize=(10, 6))  # Set the figure size
plt.plot(data['date'], data['Appliances'], color='#1f77b4',  linestyle='-')  # Plot with markers
plt.title('Relationship between Date and Appliances')
plt.xlabel('Date')
plt.ylabel('Appliances (Wh)')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.grid(True)  # Add gridlines
plt.tight_layout()  # Adjust layout to prevent clipping of labels
plt.show()
Enter fullscreen mode Exit fullscreen mode

Image description

From above we can see that there is no visually strong pattern between electricity comsumption -- 'Appliances', and date. However, this absence of a conspicuous pattern doesn't necessarily imply that the data lacks underlying structure. On the contrary, it underscores the importance of our task to uncover and understand the latent patterns hidden within the data.

# Calculating the simple moving averages (SMA) of the 'Appliances' column in the data DataFrame using different window sizes
window_size_1 = 100
window_size_2 = 200
window_size_3 = 500
sma_1 = data['Appliances'].rolling(window=window_size_1).mean() # get the mean in the time span of 100 past data entries
sma_2 = data['Appliances'].rolling(window=window_size_2).mean() # get the mean in the time span of 200 past data entries
sma_3 = data['Appliances'].rolling(window=window_size_3).mean() # get the mean in the time span of 300 past data entries

# Draw the plt
plt.figure(figsize=(10, 6))  
plt.plot(data['date'], data['Appliances'], label='Original Data')
plt.plot(data['date'], sma_1, label='SMA (window size = {})'.format(window_size_1))
plt.plot(data['date'], sma_2, label='SMA (window size = {})'.format(window_size_2))
plt.plot(data['date'], sma_3, label='SMA (window size = {})'.format(window_size_3))
plt.xlabel('Date')
plt.ylabel('Appliances')
plt.title('Simple Moving Average')
plt.legend()

# Rotate x-axis labels by 45 degrees
plt.xticks(rotation=45)

plt.show()

Enter fullscreen mode Exit fullscreen mode

Image description

After applying the Simple Moving Average (SMA) model with window sizes of 100, 200, and 500, and closely observing the resulting graphs, no significant trends were discernible at a first glance. This suggests that the 'Appliances' value remains relatively stable (stationary) throughout the duration of the data capture.

  • Visualization of Randomly Selected Weekly Data
import random

len_weekly_data = 6*24*7 # 10 entries per minute => 6 times per hour => * 24 h * 7 days

# Function to randomly select 8 different weeks from the entire time range
def select_random_weeks(data):
    # Determine the total number of complete weeks
    num_weeks = len(data) // len_weekly_data
    # Randomly select 8 different week indices
    random_week_indices = random.sample(range(num_weeks), 8)
    # Sort the selected week indices
    random_week_indices.sort()
    return random_week_indices

# Randomly select 8 different weeks
random_week_indices = select_random_weeks(data)

# Create subplots arranged in a 4x2 grid
fig, axs = plt.subplots(4, 2, figsize=(10, 12))
fig.suptitle('Visualization of Randomly Selected Weekly Appliances Data')

# Plotting each randomly selected week's data
for i in range(4):
    for j in range(2):
        # Calculate the start and end index for the randomly selected week
        start_index = random_week_indices[i * 2 + j] * len_weekly_data
        end_index = start_index + len_weekly_data

        # Filter data for the randomly selected week
        week_data = data.iloc[start_index:end_index]

        # Plotting the relationship between 'date' and 'Appliances' for the randomly selected week
        axs[i, j].plot(week_data['date'], week_data['Appliances'], color='#1f77b4', linestyle='-')
        axs[i, j].set_title(f'Week {random_week_indices[i * 2 + j] + 1}')
        axs[i, j].set_xlabel('Date')
        axs[i, j].set_ylabel('Appliances (Wh)')
        axs[i, j].tick_params(axis='x', rotation=45)
        axs[i, j].grid(True)

# Adjust layout to prevent overlapping of subplots
plt.tight_layout()
plt.show()

Enter fullscreen mode Exit fullscreen mode

Image description

Upon analyzing the randomly selected weekly data, a discernible pattern emerged, indicating consistent fluctuations in energy consumption. Despite the random sampling, recurring trends were evident, suggesting underlying patterns in appliance usage throughout the observed time period. This consistency hints at potential factors influencing energy consumption behavior. This aligns with our Hourly Analysis previously.

# Plotting the histogram for the "Appliances" column
plt.figure(figsize=(10, 4))  # Setting the figure size
bin_width = 25  # Define the bin width
max_value = int(max(data['Appliances']))
bins = range(0, max_value + bin_width, bin_width)  # Define the bins with a gap of 50 units starting from 0
plt.hist(data['Appliances'], bins=bins, color='#1f77b4', edgecolor='white')  # Plotting histogram with specified bins
plt.title('Distribution of Appliances')  # Adding title
plt.xlabel('Appliances (Wh)')  # Adding x-axis label
plt.ylabel('Frequency')  # Adding y-axis label
plt.grid(True)  # Adding gridlines
plt.tight_layout()  # Adjusting layout to prevent clipping of labels
plt.show()  # Displaying the plot

Enter fullscreen mode Exit fullscreen mode

Image description

data.boxplot(column='Appliances', vert=False, figsize=(10, 4))

# Add title and labels
plt.title('Horizontal Box Plot of Appliances')
plt.xlabel('Value')
plt.ylabel('Appliances')

# Show the plot
plt.show()
Enter fullscreen mode Exit fullscreen mode

Image description

The box plot has a long right-hand side tail, it indicates that it is positively skewed with some high energy use.

  • Correlation analysis
# Delete date time stamps and irrelavant columns
new_data = new_data.drop(['Date', 'Time', 'rv1', 'rv2'], axis=1)

# Auto-check if there is any format that is not 'float64' or 'int64'
for column in new_data.columns:
    if new_data[column].dtype != 'float64' and new_data[column].dtype != 'int64':
        print(f"Column '{column}' has non-numeric data type: {new_data[column].dtype}")

Enter fullscreen mode Exit fullscreen mode
Column 'hour' has non-numeric data type: int32
Column 'month' has non-numeric data type: int32
Column 'day_of_week' has non-numeric data type: int32
Enter fullscreen mode Exit fullscreen mode
# We can see from above that the 'hour' 'month' and 'day_of_week' is of data type 'int32', which cannot be used for correlation analysis
# We should change the format first.

new_data['hour'] = new_data['hour'].astype(float)
new_data['month'] = new_data['month'].astype(float)
new_data['day_of_week'] = new_data['day_of_week'].astype(float)

# Check type again.
print(new_data.dtypes)

Enter fullscreen mode Exit fullscreen mode
Appliances       int64
lights           int64
T1             float64
RH_1           float64
T2             float64
RH_2           float64
T3             float64
RH_3           float64
T4             float64
RH_4           float64
T5             float64
RH_5           float64
T6             float64
RH_6           float64
T7             float64
RH_7           float64
T8             float64
RH_8           float64
T9             float64
RH_9           float64
T_out          float64
Press_mm_hg    float64
RH_out         float64
Windspeed      float64
Visibility     float64
Tdewpoint      float64
hour           float64
month          float64
day_of_week    float64
dtype: object
Enter fullscreen mode Exit fullscreen mode
# Calculate correlation coefficients
correlations = new_data.corr()['Appliances'].drop(['Appliances'])

# Plotting the correlations
plt.figure(figsize=(15, 6))

# Define colors based on correlation values
colors = ['red' if corr < 0 else '#1f77b4' for corr in correlations]

# Plot the bar chart with custom colors
bars = correlations.plot(kind='bar', color=colors)

# Add text annotations
for i, corr in enumerate(correlations):
    if corr < 0:
        plt.text(i, corr, f"{corr:.2f}", ha='center', va='top', fontsize=8)
    else:
        plt.text(i, corr, f"{corr:.2f}", ha='center', va='bottom', fontsize=8)


plt.title('Correlation with Appliances')
plt.xlabel('Features')
plt.ylabel('Correlation Coefficient')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

Enter fullscreen mode Exit fullscreen mode

Image description

We can observe that the strongest correlation, standing at 0.22, exists between appliances consumption and the time of day (hour). This is followed by lights with a correlation coefficient of 0.2, and air pressure with a correlation coefficient of -0.15. This indicates that 'hour' and 'lights' predominantly influence home appliances' electricity consumption.

Additionally, the positive correlation coefficients for 'hour' and 'lights' imply that as the time of day increases or as more lights are turned on, there is a tendency for higher appliances consumption. Conversely, the negative correlation coefficient for air pressure suggests a slight decrease in appliances consumption as air pressure rises. This highlights the significance of considering temporal factors and household activities in understanding and managing electricity consumption.

# Calculate pairwise correlations
corr_matrix = new_data.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

# Set up the matplotlib figure
plt.figure(figsize=(12, 10))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr_matrix, mask=mask, cmap='coolwarm', annot=True, fmt=".2f", linewidths=.5)

plt.title('Pairwise Correlation Heatmap')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Image description

Here I visually present the correlations among different features using a heatmap. It reveals strong positive correlations among temperatures and weak positive correlations among humidity levels. RH_6 (outdoor humidity) stands out as it exhibits strong negative correlations with temperatures, indicating that as temperatures rise, humidity levels decrease.

In contrast to the correlations among temperatures and humidities, the correlation between our target variable 'Appliances' and these variables appears insignificant.

  • Clustering Analysis
# Selecting features for clustering (excluding 'Appliances' since it's the target feature) that with an abstract value of over 0.06.
data_cluster = new_data[['Appliances','lights','T2','RH_2','T3','T6','RH_6','RH_7','RH_8','T_out','Press_mm_hg','RH_out','hour']]
data_cluster
Enter fullscreen mode Exit fullscreen mode
Appliances lights T2 RH_2 T3 T6 RH_6 RH_7 RH_8 T_out Press_mm_hg RH_out hour
0 60 30 19.200000 44.790000 19.790000 7.026667 84.256667 41.626667 48.900000 6.600000 733.5 92.000000 17.0
1 60 30 19.200000 44.722500 19.790000 6.833333 84.063333 41.560000 48.863333 6.483333 733.6 92.000000 17.0
2 50 30 19.200000 44.626667 19.790000 6.560000 83.156667 41.433333 48.730000 6.366667 733.7 92.000000 17.0
3 50 40 19.200000 44.590000 19.790000 6.433333 83.423333 41.290000 48.590000 6.250000 733.8 92.000000 17.0
4 60 40 19.200000 44.530000 19.790000 6.366667 84.893333 41.230000 48.590000 6.133333 733.9 92.000000 17.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
19730 100 0 25.890000 42.025714 27.200000 24.796667 1.000000 44.500000 50.074000 22.733333 755.2 55.666667 17.0
19731 90 0 25.754000 42.080000 27.133333 24.196667 1.000000 44.414286 49.790000 22.600000 755.2 56.000000 17.0
19732 270 10 25.628571 42.768571 27.050000 23.626667 1.000000 44.400000 49.660000 22.466667 755.2 56.333333 17.0
19733 420 10 25.414000 43.036000 26.890000 22.433333 1.000000 44.295714 49.518750 22.333333 755.2 56.666667 17.0
19734 430 10 25.264286 42.971429 26.823333 21.026667 1.000000 44.054000 49.736000 22.200000 755.2 57.000000 18.0

19735 rows × 13 columns

scaler = StandardScaler()
data_cluster_scaled = scaler.fit_transform(data_cluster)
Enter fullscreen mode Exit fullscreen mode

After trying to use the CURE cluster analysis method, it was found that the classification results were poor. Combined with the reality that our data is more linear and evenly distributed, I finally chose Kmeans to do the cluster analysis.

from sklearn.cluster import KMeans

# Create a KMeans analyzer
kmeans = KMeans(n_clusters=5, random_state=42)

# Run the KMeans analysis
kmeans.fit(data_cluster_scaled)

# Get the cluster labels
cluster_labels = kmeans.labels_

# Add the 'cluster' column to the DataFrame
data_cluster['cluster'] = cluster_labels

# Print the size of each cluster
for cluster_label in range(5):
    cluster_size = len(cluster_labels[cluster_labels == cluster_label])
    print(f"Cluster {cluster_label}: {cluster_size} data points")
Enter fullscreen mode Exit fullscreen mode
Cluster 0: 4822 data points
Cluster 1: 4732 data points
Cluster 2: 4236 data points
Cluster 3: 3328 data points
Cluster 4: 2617 data points
Enter fullscreen mode Exit fullscreen mode
# Create a subplot grid containing all features
fig, axs = plt.subplots(3, 4, figsize=(16, 12))

# Flatten axs into a 1D array for iteration
axs = axs.flatten()

# Get the names of all feature columns except the target variable 'Appliances'
feature_columns = data_cluster.columns.drop('Appliances')

# Plot scatter plot of each feature against the target variable 'Appliances' and color by cluster
for i, feature in enumerate(feature_columns):
    # Check if the subplot index is within bounds
    if i < len(axs):
        sns.scatterplot(x=feature, y='Appliances', hue='cluster', data=data_cluster, ax=axs[i], alpha=0.5, palette='Blues')
        axs[i].set_xlabel(feature)
        axs[i].set_ylabel('Appliances')
        axs[i].legend(title='Cluster')

plt.suptitle('Scatter Plots of Features against Appliances with Cluster Information', fontsize=16)

# Adjust spacing and layout of subplots
plt.tight_layout()
plt.show()

Enter fullscreen mode Exit fullscreen mode

Image description

The separation between clusters appears to be quite good and relatively uniform. Particularly in plots related to temperature, the separation is more pronounced, while it's less evident in humidity-related plots. The correlation with lights is also noticeable. This suggests that the dataset has been reasonably well-classified.

2.5 Data Pre-processing for prediction model training

For both ARIMA and LSTM models, while ARIMA is generally insensitive to feature scaling, it's essential to note that LSTM models, being a type of recurrent neural network (RNN), are sensitive to feature scaling. Fortunately, since there are no missing values in our dataset, imputation is unnecessary. This streamlined approach to data preprocessing ensures a smoother modeling process.

from sklearn.preprocessing import MinMaxScaler

dataset = data_target.values
dataset = dataset.astype('float32')
Enter fullscreen mode Exit fullscreen mode
# normalize the dataset for LSTM
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(dataset)
Enter fullscreen mode Exit fullscreen mode
# split into train and test sets
train_size = int(len(dataset) * 0.8)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]
Enter fullscreen mode Exit fullscreen mode

3. Implement prediction models

Here I initially selected the ARIMA and LSTM models for training and making predictions. They are both renowned for their effectiveness in handling time series data.

Specifically, ARIMA excels when the data exhibit linear trends and seasonality. I presume the data exhibits near-linear trends, primarily because it was collected during a period characterized by steadily increasing temperatures. Additionally, an analysis conducted in the previous section indicates the presence of certain seasonal patterns within the data. These factors suggest that the data may adhere to a consistent upward trajectory over time, punctuated by recurring seasonal fluctuations. This attribute making it suitable for capturing patterns using ARIMA model. On the other hand, LSTM (Long Short-Term Memory) networks shine in capturing complex nonlinear relationships and long-term dependencies within sequential data, thus proving particularly valuable when dealing with non-stationary time series or those with intricate patterns over time.

However, in addition to these two specialized models, I also opted for several traditional machine learning algorithms that are not inherently designed for time series analysis, aiming to provide a comprehensive comparison. These models, though not tailored explicitly for time series data, offer different perspectives and approaches to prediction tasks, enriching the analysis and potentially uncovering valuable insights.

3.1 ARIMA model implementing

  • Seasonal decomposition
from statsmodels.tsa.seasonal import seasonal_decompose
from pylab import rcParams
from matplotlib import pyplot

# Prepare data to be trained. Here we just use the target feature 'Appliances' together with the time feature 'date'
data_arima = data[['Appliances', 'date']]
data_arima = data_arima.set_index('date')

decomp_period  = 6*24*7 # Assuming data is sampled every 10 minutes -- 1 week

# Use the lib to decompose a time series into its constituent components: trend, seasonality, and residuals
result = seasonal_decompose(data_arima, period=decomp_period, model='additive')
rcParams['figure.figsize'] = 12, 12
fig = result.plot()
pyplot.show()

# Explicitly store the result's attribute to new variables.
residual = result.resid
seasonal = result.seasonal
trend = result.trend
Enter fullscreen mode Exit fullscreen mode

Image description

# Checking if there are any null values. 
residual.isnull().values.any()
Enter fullscreen mode Exit fullscreen mode
True
Enter fullscreen mode Exit fullscreen mode

We got 'True' means that we have to delete null values.

# dopping NaN values
residual.dropna(inplace=True)
Enter fullscreen mode Exit fullscreen mode

We can see from the above that the data could be considered stationary, so ideal d should be 0.

print(residual.head())
Enter fullscreen mode Exit fullscreen mode
date
2016-01-15 05:00:00   -15.840400
2016-01-15 05:10:00   -22.652744
2016-01-15 05:20:00   -14.733727
2016-01-15 05:30:00   -15.757380
2016-01-15 05:40:00   -16.233571
Name: resid, dtype: float64
Enter fullscreen mode Exit fullscreen mode
from statsmodels.tsa.stattools import acf, pacf

# Calculating the autocorrelation function (ACF) of the residuals with a lag of 20
acf_cases = acf(residual, nlags = 20)

# Calculating the partial autocorrelation function (PACF) of the residuals with a lag of 20 using the OLS (Ordinary Least Squares) method
pacf_cases = pacf(residual, nlags = 20, method = 'ols')

rcParams['figure.figsize'] = 12, 4
plt.subplot(121)
plt.plot(acf_cases)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(residual)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(residual)),linestyle='--',color='gray')
plt.title('Autocorrelation Function')

plt.subplot(122)
plt.plot(pacf_cases)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(residual)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(residual)),linestyle='--',color='gray')
plt.title('Partial Autocorrelation Function')
plt.tight_layout()
Enter fullscreen mode Exit fullscreen mode

Image description

  • ACF & PACF
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
plt.figure(figsize=(10, 4))  # Set the figure size

plot_acf(residual, lags = 40)
plt.ylim(0, 1.2)  # set the range of Y axis 
Enter fullscreen mode Exit fullscreen mode
(0.0, 1.2)




<Figure size 1000x400 with 0 Axes>
Enter fullscreen mode Exit fullscreen mode

Image description

plt.figure(figsize=(10, 4))  # Set the figure size
plot_pacf(residual, lags = 40)
plt.ylim(-0.2, 1.2)  # set the range of Y axis 
Enter fullscreen mode Exit fullscreen mode
(-0.2, 1.2)




<Figure size 1000x400 with 0 Axes>
Enter fullscreen mode Exit fullscreen mode

Image description

Here we can see that lag before 2 of the lags are out of the significance limit or below the threshold so we can say that the optimal value of our p and q is 3.

3.2 LSTM model implementing

# LSTM for international airline passengers problem with time step regression framing
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from sklearn.metrics import mean_squared_error
Enter fullscreen mode Exit fullscreen mode

For LSTM model training, I have to define a function called create_dataset that converts an array of values into a dataset matrix suitable for training. Use a loop to create the input-output pairs for the dataset matrix. For each iteration, the function extracts a sequence of look_back values from the dataset to form an input feature array a.

# Function to convert an array of values into a dataset matrix, default look back set to 1.
def create_dataset(dataset, look_back=1):
    dataX, dataY = [], []
    for i in range(len(dataset)-look_back-1):
        a = dataset[i:(i+look_back), 0]
        dataX.append(a)
        dataY.append(dataset[i + look_back, 0])
    return np.array(dataX), np.array(dataY)
# Fix random seed for reproducibility
tf.random.set_seed(7)
Enter fullscreen mode Exit fullscreen mode
# Reshape into X=t and Y=t+1
look_back = 3 # explicitly set look back value
trainX, trainY = create_dataset(train, look_back)
testX, testY = create_dataset(test, look_back)

# Reshape input to be [samples, time steps, features]
trainX = np.reshape(trainX, (trainX.shape[0], trainX.shape[1], 1))
testX = np.reshape(testX, (testX.shape[0], testX.shape[1], 1))
Enter fullscreen mode Exit fullscreen mode

3.3 Machine Learning Methods

  • Preparing data for machine learning methods
from sklearn.ensemble import GradientBoostingRegressor,RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.model_selection import train_test_split


x = data.drop(['Appliances','date','Windspeed','Visibility','Tdewpoint','rv1','rv2'],axis=1)
y = data['Appliances'] # set target feature


# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

Enter fullscreen mode Exit fullscreen mode
  • import random forest regressor and gradient boosting regressor
# 2 models
rf = RandomForestRegressor(random_state=42)
gb = GradientBoostingRegressor(random_state=42)
Enter fullscreen mode Exit fullscreen mode

4. Train prediction models

4.1 ARIMA model training

from statsmodels.tsa.arima.model import ARIMA

# Define ARIMA model, with p, d, q as 3, 0, 3
model_AR = ARIMA(residual, order=(3, 0, 3), freq='10T')

# Fit model
results_AR = model_AR.fit() # 10 mins per capture
Enter fullscreen mode Exit fullscreen mode
c:\Users\sulij\anaconda3\envs\mbd\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency 10min will be used.
  self._init_dates(dates, freq)
Enter fullscreen mode Exit fullscreen mode

4.2 LSTM model training

In the experimental stage, I conducted experiments with epochs=10 and concluded that 'the value of loss tends to approach the minimum when epochs >= 3'. Therefore, for the convenience of repeated experiments, I have set epochs directly to 3 here, which can shorten the runtime.

# create and fit the LSTM network
model = Sequential()
model.add(LSTM(4, input_shape=(look_back, 1)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')

# from previous tests, I have already know that after ephochs 3, the loss rate will tend to be mininum. So choose 3 as our epochs.
model.fit(trainX, trainY, epochs=3, batch_size=1, verbose=2)
Enter fullscreen mode Exit fullscreen mode
Epoch 1/3
15784/15784 - 19s - 1ms/step - loss: 0.0048
Epoch 2/3
15784/15784 - 17s - 1ms/step - loss: 0.0041
Epoch 3/3
15784/15784 - 18s - 1ms/step - loss: 0.0040





<keras.src.callbacks.history.History at 0x224d0497c10>
Enter fullscreen mode Exit fullscreen mode

4.3 Random Forest model training

from sklearn.model_selection import RandomizedSearchCV

# Define the hyperparameter grid
params = {'n_estimators': [100,200,500],'max_depth':[3,5,8]}

# Reduce the number of folds for cross-validation
folds = 3

# Use RandomizedSearchCV instead of GridSearchCV
model_cv_1 = RandomizedSearchCV(estimator=rf, 
                                param_distributions=params, 
                                scoring='neg_mean_absolute_error', 
                                cv=folds, 
                                return_train_score=True,
                                n_iter=10,  # Number of parameter settings sampled
                                verbose=1, 
                                n_jobs=-1)

# Fit the model
model_cv_1.fit(x_train, y_train)
Enter fullscreen mode Exit fullscreen mode
Fitting 3 folds for each of 9 candidates, totalling 27 fits
Enter fullscreen mode Exit fullscreen mode

4.4 Gradient Boosting Regressor training

# GradientBoostingRegressor
params = {'max_depth': [1,2,3]}

# cross validation
folds = 3
model_cv_2 = GridSearchCV(estimator = gb, 
                        param_grid = params, 
                        scoring= 'neg_mean_absolute_error', 
                        cv = folds, 
                        return_train_score=True,
                        verbose = 1)            

model_cv_2.fit(x_train, y_train)
Enter fullscreen mode Exit fullscreen mode
Fitting 3 folds for each of 3 candidates, totalling 9 fits
Enter fullscreen mode Exit fullscreen mode

5. Test predictin models and show results

5.1 ARIMA model predictions

  • Residual predictions
# ARIMA
plt.figure(figsize=(10, 6))  # Set the figure size

# Draw the outcome
plt.plot(residual, label='Actual')
plt.plot(results_AR.fittedvalues, color='red', label='Predicted')
plt.xlabel("Date")
plt.ylabel("Residual")
plt.legend(loc="best")

Enter fullscreen mode Exit fullscreen mode
<matplotlib.legend.Legend at 0x224df779990>
Enter fullscreen mode Exit fullscreen mode

Image description

  • Final predictions

The autoregressive (AR) model relies on past observations to predict future values, so I have carefully chosed the appropriate start time and end time.

# We choose a later period of time to make predictions.
start_time = "2016-01-15 17:00:00"
end_time = "2016-05-27 18:00:00"
predictions = results_AR.predict(start=start_time, end=end_time, dynamic=False, freq='10T')
Enter fullscreen mode Exit fullscreen mode
predictions
Enter fullscreen mode Exit fullscreen mode
2016-01-15 17:00:00   -69.068783
2016-01-15 17:10:00   -44.970132
2016-01-15 17:20:00   -37.958415
2016-01-15 17:30:00   -59.941219
2016-01-15 17:40:00   -47.876292
                         ...    
2016-05-27 17:20:00    -0.155686
2016-05-27 17:30:00    -0.155686
2016-05-27 17:40:00    -0.155686
2016-05-27 17:50:00    -0.155686
2016-05-27 18:00:00    -0.155686
Freq: 10min, Name: predicted_mean, Length: 19159, dtype: float64
Enter fullscreen mode Exit fullscreen mode
# Add back trend + seasonal, or otherwise the whole prediction will be definitely lower than original data
final_predictions = predictions + trend + seasonal
final_predictions.shape
Enter fullscreen mode Exit fullscreen mode
(19735,)
Enter fullscreen mode Exit fullscreen mode
plt.figure(figsize=(10, 6))  # Set the figure size

# Plot original data
plt.plot(data['date'], data_arima['Appliances'], label='Original Data')

# Plot predictions
plt.plot(data['date'], final_predictions, color='red', label='Predictions')

# Set labels and title
plt.xlabel('Date')
plt.ylabel('Appliances (Wh)')
plt.title('ARIMA Model Predictions')

# Add legend
plt.legend()
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.grid(True)  # Add gridlines

# Show plot
plt.show()
Enter fullscreen mode Exit fullscreen mode

Image description

5.2 LSTM predictions

# make predictions
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)
# invert predictions
trainPredict = scaler.inverse_transform(trainPredict)
trainY = scaler.inverse_transform([trainY])
testPredict = scaler.inverse_transform(testPredict)
testY = scaler.inverse_transform([testY])

# shift train predictions for plotting
trainPredictPlot = np.empty_like(dataset)
trainPredictPlot[:, :] = np.nan
trainPredictPlot[look_back:len(trainPredict)+look_back, :] = trainPredict
# shift test predictions for plotting
testPredictPlot = np.empty_like(dataset)
testPredictPlot[:, :] = np.nan
testPredictPlot[len(trainPredict)+(look_back*2)+1:len(dataset)-1, :] = testPredict
Enter fullscreen mode Exit fullscreen mode
[1m494/494[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step
[1m124/124[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 886us/step
Enter fullscreen mode Exit fullscreen mode
# Plot baseline and predictions
plt.figure(figsize=(10, 6))  # Set the figure size
plt.plot(data['date'], data['Appliances'], color='#1f77b4',  linestyle='-', label='Original Data')  # Plot with markers
plt.plot(data['date'], trainPredictPlot, color='red',  linestyle='-', label='Training Predictions')  # Plot with markers
plt.plot(data['date'], testPredictPlot, color='green',  linestyle='-', label='Test Predictions')  # Plot with markers

plt.title('LSTM Model Predictions')
plt.xlabel('Date')
plt.ylabel('Appliances (Wh)')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.grid(True)  # Add gridlines
plt.tight_layout()  # Adjust layout to prevent clipping of labels
plt.legend()  # Add legend
plt.show()

Enter fullscreen mode Exit fullscreen mode

Image description

5.3 Random Forest predictions

#final RandomForestRegressor

rf = RandomForestRegressor(max_depth=8, n_estimators=500,random_state=42)
rf.fit(x_train, y_train)
Enter fullscreen mode Exit fullscreen mode
RandomForestRegressor(max_depth=8, n_estimators=500, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
  RandomForestRegressor?Documentation for RandomForestRegressoriFitted
RandomForestRegressor(max_depth=8, n_estimators=500, random_state=42)

rf_test_pred = rf.predict(x_test)
Enter fullscreen mode Exit fullscreen mode

5.4 Gradient Boosting Regressor predictions

#final GradientBoostingRegressor

gb = GradientBoostingRegressor(max_depth=5,random_state=42)
gb.fit(x_train, y_train)
Enter fullscreen mode Exit fullscreen mode
GradientBoostingRegressor(max_depth=5, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
  GradientBoostingRegressor?Documentation for GradientBoostingRegressoriFitted
GradientBoostingRegressor(max_depth=5, random_state=42)

gb_test_pred = gb.predict(x_test)
Enter fullscreen mode Exit fullscreen mode

6. Compare the results from all candidate models, choose the best model, justify your choice and discuss the results

Compare the results from all candidate models, choose the best model, justify your choice and discuss the results.

  • ARIMA
data_subset = data_arima['Appliances'][(data_arima.index >= start_time) & (data_arima.index <= end_time)]

# Calculate MAE, RMSE, and MAPE for the combined dataset
mae_ar = np.mean(np.abs(data_subset- final_predictions))
mse_ar = np.mean((data_subset- final_predictions) ** 2)
rmse_ar = np.sqrt(mse_ar)
mape_ar = np.mean(np.abs((data_subset- final_predictions) / data_subset)) * 100

print("Mean Absolute Error (MAE) for ARIMA:", mae_ar)
print("Root Mean Squared Error (RMSE) for ARIMA:", rmse_ar)
print("Mean Absolute Percentage Error (MAPE) for ARIMA:", mape_ar)
Enter fullscreen mode Exit fullscreen mode
Mean Absolute Error (MAE) for ARIMA: 30.430244505967522
Root Mean Squared Error (RMSE) for ARIMA: 62.33467043574461
Mean Absolute Percentage Error (MAPE) for ARIMA: 31.594001482855322
Enter fullscreen mode Exit fullscreen mode
  • LSTM
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Combine trainPredict and testPredict into a single dataset
combined_predictions = np.concatenate((trainPredict[:,0], testPredict[:,0]))

# Calculate MAE, RMSE, and MAPE for the combined dataset
mae_lstm = mean_absolute_error(np.concatenate((trainY[0], testY[0])), combined_predictions)
rmse_lstm = np.sqrt(mean_squared_error(np.concatenate((trainY[0], testY[0])), combined_predictions))
mape_lstm = np.mean(np.abs((np.concatenate((trainY[0], testY[0])) - combined_predictions) / np.concatenate((trainY[0], testY[0]))) * 100)

print("Mean Absolute Error (MAE) for LSTM:", mae_lstm)
print("Root Mean Squared Error (RMSE) for LSTM:", rmse_lstm)
print("Mean Absolute Percentage Error (MAPE) for LSTM:", mape_lstm)
Enter fullscreen mode Exit fullscreen mode
Mean Absolute Error (MAE) for LSTM: 29.576427576887188
Root Mean Squared Error (RMSE) for LSTM: 65.57357656216654
Mean Absolute Percentage Error (MAPE) for LSTM: 29.764518825012665
Enter fullscreen mode Exit fullscreen mode
  • Random Forest
# Calculate MAE, RMSE, and MAPE for the combined dataset
mae_rf = metrics.mean_absolute_error(y_test, rf_test_pred)
rmse_rf=np.sqrt(metrics.mean_squared_error(y_test, rf_test_pred))
mape_rf = np.mean(np.abs((y_test - rf_test_pred) / y_test)) * 100

print("Mean Absolute Error (MAE) for RF:", mae_rf)
print("Root Mean Squared Error (RMSE) for RF:", rmse_rf)
print("Mean Absolute Percentage Error (MAPE) for RF:", mape_rf)
Enter fullscreen mode Exit fullscreen mode
Mean Absolute Error (MAE) for RF: 44.99424619020264
Root Mean Squared Error (RMSE) for RF: 83.49559904823651
Mean Absolute Percentage Error (MAPE) for RF: 52.11667094759201
Enter fullscreen mode Exit fullscreen mode
  • Gradient Boosting
# Calculate MAE, RMSE, and MAPE for the combined dataset
mae_gb = metrics.mean_absolute_error(y_test, gb_test_pred)
rmse_gb=np.sqrt(metrics.mean_squared_error(y_test, gb_test_pred))
mape_gb = np.mean(np.abs((y_test - gb_test_pred) / y_test)) * 100

print("Mean Absolute Error (MAE) for GB:", mae_gb)
print("Root Mean Squared Error (RMSE) for GB:", rmse_gb)
print("Mean Absolute Percentage Error (MAPE) for GB:", mape_gb)
Enter fullscreen mode Exit fullscreen mode
Mean Absolute Error (MAE) for GB: 41.618740815093545
Root Mean Squared Error (RMSE) for GB: 78.7710393221513
Mean Absolute Percentage Error (MAPE) for GB: 46.484278532018166
Enter fullscreen mode Exit fullscreen mode
import numpy as np
import matplotlib.pyplot as plt

models = ['ARIMA', 'LSTM', 'RF', 'GB']

mae_scores = [mae_ar, mae_lstm, mae_rf, mae_gb]
rmse_scores = [rmse_ar, rmse_lstm, rmse_rf, rmse_gb]
mape_scores = [mape_ar, mape_lstm, mape_rf, mape_gb]

bar_width = 0.25

index = np.arange(len(models))

plt.bar(index - bar_width, mae_scores, bar_width, label='MAE', color='red')

plt.bar(index, rmse_scores, bar_width, label='RMSE', color=color)

plt.bar(index + bar_width, mape_scores, bar_width, label='MAPE', color='skyblue')

for i in range(len(models)):
    plt.text(index[i] - bar_width, mae_scores[i] + 0.01, str(round(mae_scores[i], 2)), ha='center', va='bottom')
    plt.text(index[i], rmse_scores[i] + 0.01, str(round(rmse_scores[i], 2)), ha='center', va='bottom')
    plt.text(index[i] + bar_width, mape_scores[i] + 0.01, str(round(mape_scores[i], 2)), ha='center', va='bottom')

plt.xlabel('Models')
plt.ylabel('Scores')
plt.title('Comparison of Model Scores')

plt.xticks(index, models)

plt.legend()

plt.show()
Enter fullscreen mode Exit fullscreen mode

Image description

From the graph, we can see that the model scores of ARIMA and LSTM are very close, placing them in the top tier. Following closely are Gradient Boosting and Random Forest.

The flection from this is that model choosing and parameters determination is a complex process. In this assignment, given your data collected every 10 minutes over a period of over 4 months, both ARIMA and LSTM models are viable options for modeling and prediction. And since we can have prior knowledge of the time series features, ARIMA might be a suitable choice. The fact that by simple parameter selection process, it reached a relatively good performance.

Traditional machine learning methods like RF and GB can be used for this kind of problems but it's usually harder to implement compared to ARIMA and LSTM, and the result is not ideal.

7. Reflect on what you have learned by completing this assignment

It was my first foray into this field, and the project provided me with valuable insights into the characteristics of time series data and introduced me to relevant analytical methods.

After EDA, I identified the seasonal feature of the dataset and segmented the main feature into trend, seasonal, and residuals using the 'tsa' library for ARIMA model implementation. This step is crucial as ARIMA focuses on studying and predicting residuals. I then selected the appropriate parameters for p, q, and d by observing ACF and PACF. In the case of LSTM model, I learned the importance of data scaling. The model autonomously optimized its parameters using 'epochs', akin to the process in Random Forest and Gradient Boosting, where 'scikit-learn' searches for the best hyperparameters within specified distributions. Upon comparing the performance of various models, I realized that while all four models achieved similar outcomes, the selection of model could significantly impact prediction accuracy.

Throughout the entire process, I dedicated over 20 hours to studying relevant concepts online, reviewing provided materials, testing models, and troubleshooting issues. This endeavor deepened my understanding of time series forecasting. However, I am acutely aware that refining models for enhanced efficiency and accuracy is an ongoing journey. I view this experience as a promising starting point for my future studies.

8. References

https://www.kaggle.com/code/gaganmaahi224/appliances-energy-time-series-analysis#ARMA-Models
(for data preprocessing)

https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
(for ARIMA model learning and parameters determination)

https://machinelearningmastery.com/decompose-time-series-data-trend-seasonality/
(for understanding seasonal decomposition)

https://www.investopedia.com/terms/a/autoregressive-integrated-moving-average-arima.asp#:~:text=An%20autoregressive%20integrated%20moving%20average%2C%20or%20ARIMA%2C%20is%20a%20statistical,values%20based%20on%20past%20values

Top comments (0)