Linear Regression is an essential statistical method used to analyze the correlation between two variables. In this article we will study the concept of simple linear regression and implement it using python language.
1. Introduction to Simple Linear Regression :
Simple Linear Regression is a statistical method which helps us to find the relation between two variables. It mainly focuses on the exploration of the connection between a dependent variable (which we are predicting) and a independent variable ( which is used to predict ).
The equation for simple linear regression is follow :
y = b0 + b1 * x + e
Where,
-
y
is the dependent variable. -
x
is the independent variable. -
b0
is the intercept ( means value ofy
when value ofx
is zero ). -
b1
is the slope. -
e
is the error.
The mail goal of simple linear regression is finding the values of b0
and b1
which best fits the data.
For more theoretical knowledge refer this
2. Data preprocessing :
Data preprocessing is an essential step in any data analysis task. It includes data cleaning, transforming and preparing raw data for further analysis.
In previous blog post we have discussed it in detail. Please refer :
7 Essential Techniques for Data Preprocessing Using Python: A Guide for Data Scientists.
3. Implementing Simple Linear Regression using python :
Step 1 : Importing Necessary Libraries
First step in any code always includes importing the necessary libraries for further analysis.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
Step 2 : Loading and preprocessing the data
For this model we will using dataset of students which includes their study hours and their exam scores. This data is stored in a csv file named student_scores.csv
.
from scipy import stats
# Loading the data in code
df = pd.read_csv("student_scores.csv")
# Checking for missing values in data
missing_values = df.isnull().sum()
# Handling missing values
df.fillna(0, inplace=True)
# Handling Outliers
sns.boxplot(x=df['Hours'])
plt.show()
df = df[(np.abs(stats.zscore(df['Hours'])) < 3).all(axis=1)]
# Encoding the categorical Variables
df = pd.get_dummies(df, columns=['Category'])
Step 3 : Creating dependent and independent variables
x = df.iloc[:, :-1].values # independent variable
y = df.iloc[:, -1].values # dependent variable
Step 4 : Splitting dataset in training and test set
Splitting dataset can help in evaluating the model. Means we can test it using test set if it is trained well or not.
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x, y, test_size=0.2, random_state=21)
Step 5 : Fitting the Regression model on data
There is class named LinearRegression
in sklearn
library which is used to fit the model on data.
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)
Step 6 : Predicting
After fitting the model we can use that to make predictions on new data. Let's we test the model by predicting the exam score for student who studies for 9.25 hours.
hrs = 9.25
score = regressor.predict([[hours]])
print('predicted score : ',score[0][0])
Output :
Predicted Score: 92.90985477015731
Step 7 : Model Evaluation
For model evaluation two values are important first one is Mean Squared error and second is R-squared.
Below is code to calculate above values -
from sklearn.metrics import mean_squared_error, r2_score
# making predictions on entire dataset
y_pred = regressor.predict(x_test)
# calculating mean squared error
mse = mean_squared_error(y_test,y_pred)
print('mean_squared_error : ',mse)
# calculating r-squared value
r2 = r2_score(y_test,y_pred)
print("R-Squared value : ",r2)
Conclusiong :
In this article we have studied the concept of simple linear regression and implemented it using python language.
Remember, Linear regression is just one of many techniques available in regression analysis. Keep exploring and expanding your knowledge to unlock the full potential of regression analysis.
Top comments (0)