What is Regression?
Regression is a supervised learning method used to determine the relationship between variables. When the output variable is a real or continuous value, you have a regression problem.
In this lesson, I'll show you how to predict a model using linear regression, and I'll use the fish market dataset as an example. Let's get started.
What is Linear Regression?
Linear regression is a straightforward statistical regression method for predicting relationships between continuous variables. Linear regression, as the name implies, depicts a linear relationship between the independent variable (X-axis) and the dependent variable (Y-axis). Simple linear regression is defined as a linear regression with only one input variable (x). When there are several input variables, the linear regression is referred to as multiple linear regression.The linear regression model gives a sloped straight line describing the relationship within the variables.
The dependent variable and independent variables have a linear relationship, as shown in the graph above. When the value of x (the independent variable) rises, so does the value of y (the dependent variable). The best fit straight line is designated by the red line. We aim to plot a line that best predicts the data points based on the given data points.
Firstly, let's import basic utilities:
In [1]:
%matplotlib inline
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
Let's now read the csv file (keep in mind it should be in the same folder as the notebook!):
In [2]:
df = pd.read_csv('Fish.csv')
Let's take a peek at the dataset's first few rows. This is necessary in order to gain a fundamental comprehension.
In [3]:
df.head()
Now let's look at the dataset more closely to obtain essential statistical indicators such as the mean and standard deviation.
In [4]:
df.describe()
It is important to reshape the two dimensions (X and y), as failure to do so would result in an error.
In [5]:
X = np.array(df['Length1']).reshape(-1, 1)
y = np.array(df['Length2']).reshape(-1, 1)
Split the dataset in two parts: train and test. This is needed to calculate the accuracy (and many other metrics) of the model. We will use the train part during the training, and the test part during the evaluation.
In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=101)
Import the model and instantiate it:
In [7]:
from sklearn.linear_model import LinearRegression
linearmodel = LinearRegression()
Now let's train the model:
In [8]:
linearmodel.fit(X_train, y_train)
Let's have a look at how the model is performing with R2. R2 is a statistical metric used to determine whether or not a model is "a good fit" and how well it works. The Pearson Correlation Coefficient is equivalent to the R2 in this situation (one independent variable). R2 has a range of values between 0.0 and 1.0, with 0 indicating the worst fit and 1 indicating the best fit.
In [9]:
linearmodel.score(X_test, y_test)
It's quite high! This is because the two variables (Length1 and Length2), as seen during the EDA, take the shape of a straight line. Let's compare the predicted values to the test values in the dataset.
In [10]:
plt.scatter(x_test, y_test)
plt.plot(x_test, linearmodel.predict(x_test), color = 'red')
plt.show()
Linear Regression (multiple independent variables): Let's predict weight
Predicting the weight of the fish using Linear Regression is similar to the previous one. The only significant difference is the presence of numerous independent variables. The variable "Species" will be removed entirely.
In [11]:
x = fish.drop(['Weight', 'Species'], axis = 1)
y = fish['Weight']
In [12]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 42)
In [13]:
linearmodel = LinearRegression()
linearmodel.fit(x_train, y_train)
linearmodel.score(x_test, y_test)
Conclusion
The model is a good fit but it's not performing well (or rather, not as well as hoped) for this problem and data. The EDA didn't cover some basics such as feature selection and removing outliers. Also the model was deprived of a feature 'Species' which as you might imagine, may influence the weight of a fish. Another important factor is the size of the dataset, usually larger datasets lead to more accurate results. Anyways the goal of a simple linear regression is to predict the value of a dependent variable based on an independent variable. The greater the linear relationship between the independent variable and the dependent variable, the more accurate is the prediction.
Thanks for reading
P.S: I'm looking forward to being your friend, let's connect on twitter.
Top comments (0)