DEV Community

Anurag Verma
Anurag Verma

Posted on

Linear Regression in Python: From Data to Model

What is Linear Regression?

Linear regression is a statistical method used for modeling the relationship between a dependent variable (also known as the outcome or response variable) and one or more independent variables (also known as predictors or explanatory variables). The goal of linear regression is to find the best-fitting line through a set of data points, where the line is defined by an equation of the form y = mx + b, where y is the dependent variable, x is the independent variable, m is the slope of the line, and b is the y-intercept. Linear regression can be used for both simple linear regression (one independent variable) and multiple linear regression (more than one independent variable).

Linear Regression.jpeg

Importing Libraries

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline 

Enter fullscreen mode Exit fullscreen mode

Loding train and test dataset into pandas data frame

train_df = pd.read_csv("/kaggle/input/random-linear-regression/train.csv")
#Drop null values
train_df = train_df.dropna() 
train_df.head()
Enter fullscreen mode Exit fullscreen mode
x y
0 24.0 21.549452
1 50.0 47.464463
2 15.0 17.218656
3 38.0 36.586398
4 87.0 87.288984
test_df = pd.read_csv("/kaggle/input/random-linear-regression/test.csv")
# Drop null values
test_df = test_df.dropna()
test_df.head()
Enter fullscreen mode Exit fullscreen mode
x y
0 77 79.775152
1 21 23.177279
2 22 25.609262
3 20 17.857388
4 36 41.849864

Selection of independent and and dependent variable

We selected the columns in your data frame that we want to use for the x and y axis. For example, if you have a column called 'x' that represents the independent variable and a column called 'y' that represents the dependent variable, you can select those columns like this:

train_x = train_df['x']
train_y = train_df['y']

test_x = test_df['x']
test_y = test_df['y']
Enter fullscreen mode Exit fullscreen mode

Visualizing the training data

To draw a linear graph using your data frame, we use the popular data visualization library in Python called Matplotlib. We imported it above.

Now we use the plt.scatter() function to plot the data points, and the plt.plot() function to plot the line of best fit.

We also use the numpy.polyfit() function to fit a line to the data points and get the slope and y-intercept of the line of best fit.

coefficients = np.polyfit(train_x, train_y, 1)
m, b = coefficients
plt.scatter(train_x, train_y)
plt.plot(train_x, m*train_x + b)
plt.xlabel('train_x')
plt.ylabel('train_y')
plt.show()
Enter fullscreen mode Exit fullscreen mode

train data

Visualizing test data

coefficients = np.polyfit(test_x, test_y, 1)
m, b = coefficients
plt.scatter(test_x, test_y)
plt.plot(test_x, m*test_x + b)
plt.xlabel('test_x')
plt.ylabel('test_y')
plt.show()
Enter fullscreen mode Exit fullscreen mode

test data

Model Creation, training, and testing

To create a linear regression model and train and test the data using your data frame, we can use the scikit-learn library in Python. The first step is to import the library and the specific model you want to use.

For example, we use the LinearRegression class from the sklearn.linear_model module:

from sklearn.linear_model import LinearRegression
Enter fullscreen mode Exit fullscreen mode

Create an instance of the model.

model = LinearRegression()
Enter fullscreen mode Exit fullscreen mode

Now, we use the fit() method to train the model on the training data:

train_x = train_x.values.reshape(-1, 1)
test_x = test_x.values.reshape(-1, 1)
Enter fullscreen mode Exit fullscreen mode
model.fit(train_x, train_y)
Enter fullscreen mode Exit fullscreen mode

Check the coefficients of the model and the intercept using following command:

print("Coefficients: ",model.coef_)
print("Intercept: ",model.intercept_)
Enter fullscreen mode Exit fullscreen mode

Our model is trained, now we can use the predict() method to make predictions on the test data:

y_pred = model.predict(test_x)
Enter fullscreen mode Exit fullscreen mode

Evaluating model performance

We can evaluate the performance of the model by comparing the predicted values with the actual values. There are many evaluation metrics such as mean_absolute_error, mean_squared_error or r2_score.

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

print("Mean Absolute Error: ",mean_absolute_error(test_y, y_pred))
print("Mean Squared Error: ",mean_squared_error(test_y, y_pred))
print("R2 Score: ",r2_score(test_y, y_pred))
Enter fullscreen mode Exit fullscreen mode

Visualizing model performance

We can also visualize the results by plotting the test data points and the predicted line using the same approach as before.

plt.scatter(test_x, test_y)
plt.plot(test_x, y_pred, color='r')
plt.xlabel('x')
plt.ylabel('y')
plt.show()
Enter fullscreen mode Exit fullscreen mode

model visualization

End! Hope you like this...

GitHub link: Complete-Data-Science-Bootcamp

Main Post: Complete-Data-Science-Bootcamp

Buy Me A Coffee

Top comments (1)

Collapse
 
parthprajapati profile image
Parth

Nice one :) Earlier looked for some similar and easy to understood tut, but couldn't find it. Thanks