Nitin Kendre

Posted on May 30, 2023

From Data to Prediction : Mastering Simple Linear Regression with python

#python #machinelearning #datascience #productivity

Linear Regression is an essential statistical method used to analyze the correlation between two variables. In this article we will study the concept of simple linear regression and implement it using python language.

1. Introduction to Simple Linear Regression :

Simple Linear Regression is a statistical method which helps us to find the relation between two variables. It mainly focuses on the exploration of the connection between a dependent variable (which we are predicting) and a independent variable ( which is used to predict ).

The equation for simple linear regression is follow :

y = b0 + b1 * x + e

Where,

y is the dependent variable.
x is the independent variable.
b0 is the intercept ( means value of y when value of x is zero ).
b1 is the slope.
e is the error.

The mail goal of simple linear regression is finding the values of b0 and b1 which best fits the data.

For more theoretical knowledge refer this

2. Data preprocessing :

Data preprocessing is an essential step in any data analysis task. It includes data cleaning, transforming and preparing raw data for further analysis.

In previous blog post we have discussed it in detail. Please refer :
7 Essential Techniques for Data Preprocessing Using Python: A Guide for Data Scientists.

3. Implementing Simple Linear Regression using python :

Step 1 : Importing Necessary Libraries

First step in any code always includes importing the necessary libraries for further analysis.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

Step 2 : Loading and preprocessing the data

For this model we will using dataset of students which includes their study hours and their exam scores. This data is stored in a csv file named student_scores.csv .


from scipy import stats

# Loading the data in code
df = pd.read_csv("student_scores.csv")

# Checking for missing values in data
missing_values = df.isnull().sum()

# Handling missing values
df.fillna(0, inplace=True)

# Handling Outliers
sns.boxplot(x=df['Hours'])
plt.show()

df = df[(np.abs(stats.zscore(df['Hours'])) < 3).all(axis=1)]

# Encoding the categorical Variables
df = pd.get_dummies(df, columns=['Category'])

Step 3 : Creating dependent and independent variables

x = df.iloc[:, :-1].values  # independent variable
y = df.iloc[:, -1].values   # dependent variable

Step 4 : Splitting dataset in training and test set

Splitting dataset can help in evaluating the model. Means we can test it using test set if it is trained well or not.

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x, y, test_size=0.2, random_state=21)

Step 5 : Fitting the Regression model on data

There is class named LinearRegression in sklearn library which is used to fit the model on data.


from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)

Step 6 : Predicting

After fitting the model we can use that to make predictions on new data. Let's we test the model by predicting the exam score for student who studies for 9.25 hours.

hrs = 9.25
score = regressor.predict([[hours]])
print('predicted score : ',score[0][0])

Output :

Predicted Score: 92.90985477015731

Step 7 : Model Evaluation

For model evaluation two values are important first one is Mean Squared error and second is R-squared.

Below is code to calculate above values -


from sklearn.metrics import mean_squared_error, r2_score

# making predictions on entire dataset
y_pred = regressor.predict(x_test)

# calculating mean squared error
mse = mean_squared_error(y_test,y_pred) 
print('mean_squared_error : ',mse)


# calculating r-squared value
r2 = r2_score(y_test,y_pred)
print("R-Squared value : ",r2)

Conclusiong :

In this article we have studied the concept of simple linear regression and implemented it using python language.

Remember, Linear regression is just one of many techniques available in regression analysis. Keep exploring and expanding your knowledge to unlock the full potential of regression analysis.

DEV Community