DEV Community

Cover image for Building a Machine Learning model using Multiple Linear regression
AngelaMunyao
AngelaMunyao

Posted on

Building a Machine Learning model using Multiple Linear regression

I spend couple hours this early morning modelling this article for all Machine Learning enthusiasts, and especially those at the beginner-intermediate level.

One of the very necessary skills for ML engineers is to understand the concept of regression as related to volumes of data, both small data sets, and giant data sets.

This article covers a practical example on how to build an Machine Learning model using Multiple Linear Regression.

For this exercise, make sure to have Anaconda software installed, and from there, open Jupyter notebooks.
Download combine cycle power plant Data set From UCL Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/combined+cycle+power+plant

Import the libraries below that we will be using:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pylab as pl
Enter fullscreen mode Exit fullscreen mode

Enable inline plotting with Matplotlib

%matplotlib inline
Enter fullscreen mode Exit fullscreen mode

Import the data: Unzip the downloaded zipped folder and make sure your data file exists in the same environment as your Notebook files:

Image description

data_df=pd.read_excel("Folds5x2_pp.xlsx")
Enter fullscreen mode Exit fullscreen mode

You can now view your data with the command below (By default, it displays the first 5 rows)

data_df.head()
Enter fullscreen mode Exit fullscreen mode

Image description

To recap the variables defined above:
AT refers to temperature in the range 1.81°c - 37.11°c
Exhaust vacuum V, in the range 25.36-81.56 cm Hg
Ambient Pressure (AP) in the range 992.89-1033.30 milibar
Relative Humidity (RH) in the range 25.56% -100.16%
Net hourly electrical energy (PE) in the range 420.26-495.76MW

Your dependent variable is PE.

Let's define X and Y, X being the independent variables and Y being the dependent variable.

To capture the independent variables, we need to use the function 'x=data_df.drop(['PE'], axis=1).values'.
The drop function excludes the independent variable PE,and 'axis=1' helps drop the column, and '.values' captures the x values.

To capture the dependent variable EP, we use 'y=data_df['PE'].values'

x=data_df.drop(['PE'], axis=1).values
y=data_df['PE'].values
Enter fullscreen mode Exit fullscreen mode

Confirm X and Y values.

print(x)
Enter fullscreen mode Exit fullscreen mode

Image description

print(y)
Enter fullscreen mode Exit fullscreen mode

Split the data set into training and test set:

We use the function from Scikit library, 'train_test_split'
Import the train_test_split function.

from sklearn.model_selection import train_test_split
Enter fullscreen mode Exit fullscreen mode

Devide your data into x_train, x_test, y_train, y_test.

x_train,x_test,y_train,y_test=train_test_split(x,y)
Enter fullscreen mode Exit fullscreen mode

Voila! Your data is split into training and test set.
Next is to train the model using the training set. We will make use of linear regression.

from sklearn.linear_model import LinearRegression
model=LinearRegression()
model.fit(x_train,y_train)
Enter fullscreen mode Exit fullscreen mode

After training the model, predict the test set results.

y_pred=model.predict(x_test)
Enter fullscreen mode Exit fullscreen mode

Let's print the prediction results

print(y_pred)
Enter fullscreen mode Exit fullscreen mode

Image description

The above prediction of PE is generated for all the rows in relation to the corresponding set of independent variables represented by X.

We can also execute as below, PE prediction per a specific one row set of x values (AT, V, AP, RH).
The example below is values from the first row of x values.

model.predict([[14.96,41.76,1024.07,73.17]])
Enter fullscreen mode Exit fullscreen mode

Image description

Lets check how accurate our model is.
We need to import the function 'r2_score'.

from sklearn.metrics import r2_score
r2_score(y_test,y_pred)
Enter fullscreen mode Exit fullscreen mode

Image description
The accuracy of our model is 92. :)

Next: Lets visualize the predicted results in a scatter plot
We already imported matplotlib which we are going to use.

#Make sure to import the Figure function which we will use to increase the scale of your graph so it doesn't appear too small 
plt.figure(figsize=(15, 10))
plt.scatter(y_test,y_pred)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual PE Values vs. Predicted')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Image description

From our dataset, let's also create a comparison of the test values and the predicted values.

Use Pandas already imported as 'pd', to put the values into a data frame.

pred_comparison=pd.DataFrame({'Actual Value': y_test,'Predicted Value':y_pred, 'Diffrence':y_test-y_pred})
pred_comparison
Enter fullscreen mode Exit fullscreen mode

Image description

Above is the 1st 5 and the last 5 rows from our data set.

To view the first 40 rows for more clarity, we use:

pred_comparison[0:40]
Enter fullscreen mode Exit fullscreen mode

Image description

Awesomeee, that's our model right there! Research on more ways on how to improve the model. Bye!

Top comments (3)

Collapse
 
kamau826 profile image
kamau826

Well explained

Collapse
 
global_codess profile image
Faith Mueni Kilonzi

Cheering you on!

Collapse
 
aravinds44 profile image
Aravind S

well written❤