DEV Community


#P3 - Linear Regression

Ashutosh Sahu
I am a student eager to work on amazing projects and learn something new always. Helpful if you are stuck somewhere.
Updated on ・7 min read

Linear regression attempts to find the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an independent variable, and the other is considered to be a dependent variable.
The dependent variable is also known as the criterion variable and the independent variable is also known as the predictor variable.

Here our task is to find how the dependent variable(Y) can be predicted on the basis of the Independent variable(X). for this, we consider that all the set of points (x,y) lies on a straight line, which means there is a linear relationship between them.

But how it is possible to get accurate results by such an assumption? It is true that some points don't lie on the line and there will always be an error in our result, but you cannot expect a machine to be fully accurate. we will get good accuracy with a huge dataset.

so now we have to find the line which well satisfies the following conditions -

  • The line should pass through the point(x,y) or
  • The distance between the line and the point should be minimum. Alt Text
source - google

now the question arises that how to find such a line. No, we don't have to take a paper and start plotting all the points.
Recall geometry, which states that the equation of a line is
y = a + bx, where b is the slope (gradient) and a is the y-intercept. if we can find the value of a and b, we can find a value for y according to the given x. in this way we will be able to predict the value of y.

value of a and b can be calculated by the given formula
Alt Text

source - click here

What we have discussed till now was based on simple linear regression, in which the value of y depends on one independent variable x.

Multiple Linear Regression

When the dependent variable is dependent on more than one independent variable then Multiple linear regression is used.
Here we have to fit a regression line through a multidimensional space of data points
Alt Text

The equation of line is given by
y = b0 + b1.x1 + b2.x2 + ......
where x1,x2,... are the independent variables, b0 is the y-intercept, and b1, b2,... are slopes.
finding values of b0,b1,b2 in such case is done by using some matrix algebra.

Steps for training a model

Prerequisite - Python, Google Colab or Jupyter

The Environment

You can use Jupyter Notebooks along with Anaconda or simply the google colab. If you are using google colab you have to import the file from Github or via google.colab module. colab has some good accessibility features. Jupyter runs on your local machine so if you are low on resources, you should go for google colab.

Data Collection

The first step for getting a model trained is to collect data. I prefer using Kaggle or UCI Machine Learning Repository which provides various types of datasets. datasets are mostly available in form of CSV files(comma-separated values).

About Dataset

The dataset that we have taken for Multiple Linear Regression is from the UCI Machine Learning Repository.
you can get the CSV files from here

The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011) when the power plant was set to work with a full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH), and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant.

Attribute Information:

Features consist of hourly average ambient variables

Temperature (AT) in the range 1.81°C and 37.11°C,
Ambient Pressure (AP) in the range 992.89-1033.30 millibar,
Relative Humidity (RH) in the range of 25.56% to 100.16%
Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg
Net hourly electrical energy output (PE) 420.26-495.76 MW

The averages are taken from various sensors located around the plant that record the ambient variables every second. The variables are given without Normalization.
We have to train the model for the prediction of PE

Importing the dataset

pandas is a python library that can help you to convert CSV, excel, list, dict, NumPy array to dataframe.
we will use it to import our csv file.
if you are using colab, first upload your CSV file to colab then its path will be available as '/content/file_name.csv'

import pandas
# enter your CSV file path here
path = r"path\to\csv"
dataframe = pandas.read_csv(path)
Enter fullscreen mode Exit fullscreen mode

info() gives you info about your dataframe and head(5) returns first 5 rows of the dataframe.

Separating Independent and Dependent variables

x = dataframe.loc[:,dataframe.columns!="PE"].values
y = dataframe.loc[:,"PE"].values
Enter fullscreen mode Exit fullscreen mode

[ 14.96 41.76 1024.07 73.17]

loc is a property of dataframe that is used to select rows and columns. it can also take boolean values. : is used to select all rows or columns.

Data Preprocessing

Before training a model for any type of data, the data needs to be preprocessed to make it ready for training. we will cover all the types of preprocessing techniques in the next article.

Our dataset currently doesn't require any preprocessing, except for feature scaling, but that is also managed internally by sklearn.

Splitting training and test data

In any Supervised learning model, we divide the whole dataset into two types, training dataset, and testing dataset. We train the model on the basis of the training dataset and then test it by test dataset. generally, the training dataset occupies about 70% to 80% of the whole dataset.
Scikit-learn(sklearn) is a library in Python that provides many unsupervised and supervised learning algorithms. It's built upon NumPy, pandas, and Matplotlib.

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2, random_state=0)
Enter fullscreen mode Exit fullscreen mode

here test_size determines the size of test data. in this case test data is 20% of the whole data. random_state determines how the random function will work.

Fitting the model and predicting the values.

from sklearn.linear_model import LinearRegression
model = LinearRegression(),y_train)

# predictions
y_pred = model.predict(x_test)
print("actual  |  predicted")
for i in range(0,5):
    print("{:.2f}  |  {:.2f}".format(y_test[i], y_pred[i]))
Enter fullscreen mode Exit fullscreen mode

actual | predicted
431.23 | 431.43
460.01 | 458.56
461.14 | 462.75
445.90 | 448.60
451.29 | 457.87

And Here is your model trained.
you can see how nearly it predicts the values of "PE".

Calculating R-Square, Intercept, Slopes

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.

r_sq = model.score(x,y)
r_sq_train = model.score(x_train,y_train)
r_sq_test = model.score(x_test,y_test)
print("r_sq : ",r_sq, r_sq_train, r_sq_test)

error = 1 - r_sq

print('intercept :', model.intercept_ )
print('slope :', model.coef_)
Enter fullscreen mode Exit fullscreen mode

r_sq : 0.9286947104407257 0.9277253998587902 0.9325315554761303
intercept : 452.8410371616384
slope : [-1.97313099 -0.23649993 0.06387891 -0.15807019]

model.score() calculates the R square value.
In this case, the value of r_sq tells that the accuracy of the whole model is 92.86% while that of the training set is 92.77% and of the test set is 93.25%
model.intercept_ returns the intercept value b0
model.coef_ returns the list of coefficients (slopes) b1 b2 b3....

Feature Selection

Feature selection is the process of reducing the number of independent variables when creating a predictive model. It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and to improve the performance of the model.

Backward Elimination Method

it is one of the methods used for Feature Selection.
Steps :

  1. Select a Significance level (P-value) generally (SL = 0.05)
  2. fit the model with all possible predictors
  3. find p-values of all predictors
  4. remove the predictor with the highest p-value then fit the model again and repeat the process till the p-value is greater than 0.05

There is one thing to take care of.
y = b0 + b1.x1 + b2.x2 + b3.x3 ...
In the above equation, if you notice that every Xn has a multiplier bn but not the constant b0. The package statsmodel only considers a multiplier if it has a feature value. If there is no feature value then it would not get picked up while creating the model. So the b0 would be dropped. but if you have a x0 and set it to 1 that will solve the problem. Hence we need to create a feature with value = 1.

import statsmodels.regression.linear_model as sm
import numpy
# add a column of values = 1 (int)
be_x = numpy.append(arr = numpy.ones((9568,1)).astype(int), values = x, axis=1)

# finding significance level
x_opt = be_x[:,[0,1,2,3,4]]
ols = sm.OLS(endog = y, exog = x_opt).fit()
Enter fullscreen mode Exit fullscreen mode

Alt Text
if you see the P > |t| column there is no value, that is greater than 0.05. so there is no useless feature in our model.

What's Next

Practice by yourself. choose a dataset and try to fit the model for it. Remember that we have not dealt with various factors like categorical variables and null values yet. they all will be covered in the next article on data preprocessing. Choose your dataset wisely.

Discussion (0)