Regression in machine learning

One of the most popular uses of machine learning models, particularly in supervised machine learning, is to solve regression problems. The relationship between an outcome or dependent variable and independent variables is something that algorithms are trained to grasp. In laymans terms, it means fitting a function from a specified family of functions to the sampled data under some error function. Prediction, forecasting, time series modeling, and establishing the causal connection between variables are its key uses.

This fitting of function serves two purposes.

Estimating missing data within your data range
Estimating future data outside your data range

Although, the common application is predicting future data outside your data range after it has been trained. The Machine Learning regression algorithm is similar to our linear algebra’s line of best fit. Let’s go back in memory lane to our elementary mathematics. In maths, we were given X and y points and asked to plot a linear graph, then in exercises, we were asked to find the value of y for which x is 6 from our data. I believe we all went ahead in plotting our graph and looking for the corresponding value of y. This is very similar to our regression problem, but now we want the machine to do it for us. X here are the variables we want to use to predict y, while y is what we want to find out. Let’s have this basic understanding before we move forward

Terminologies used in Regression

Dependent Variable(Y): The dependant variable is the main factor in Regression analysis that we wish to predict or understand. It is also known as the target value or label
Independent Variable(X): The independent variables are the elements that influence the dependent variables or are used to predict the values of the dependent variables., It is usually referred to as our features.
Outliers: Outlier is an observation that contains either a very low value or a very high value in comparison to other observed values. An outlier should be avoided as it might hurt the outcome.
Multicollinearity: If the independent variables are more highly correlated with each other than other variables, then such a condition is called Multicollinearity. It should not be present in the dataset, because it creates problems while ranking the most affecting variable.
Underfitting and Overfitting: If our algorithm works well with the training dataset but not well with the test dataset, then such a problem is called Overfitting. And if our algorithm does not perform well even with the training dataset, then such a problem is called underfitting.

Uses of regression models?

Common uses for machine learning regression models include:

Forecasting continuous outcomes like house prices, stock prices, or sales.
Predicting the success of future retail sales or marketing campaigns to ensure resources are used effectively.
Predicting customer or user trends, such as on streaming services or e-commerce websites.
Analyzing datasets to establish the relationships between variables and output.
Predicting interest rates or stock values based on a multitude of factors.
Creating time series visualizations.

Types of regression analysis

Let's now discuss the various methods through which we can perform regression.

Regression can be carried out using a variety of different methods in machine learning. Machine learning regression is accomplished using a variety of well-known techniques. The various methods could use various numbers of independent variables or handle various kinds of data. A different relationship between the independent and dependent variables may also be assumed by various kinds of machine learning regression models. Linear regression techniques, for example, presume that the relationship is linear and would be ineffective with nonlinear datasets.

Types of regression models

Simple Linear Regression
Multiple linear regression
Logistic regression
Support Vector Regression
Decision Tree Regression
Random Forest Regression

Simple linear regression

Simple Linear Regression is an approach that plots a straight line within data points to minimize the error between the line and the data points. In this scenario, the connection between the independent and dependent variables is considered to be linear. This method is straightforward because it is used to investigate the relationship between the dependent variable and one independent variable. Outliers may be common in simple linear regression due to the straight line of best fit.

A statistical regression technique used for predictive analysis is called linear regression
It is one of the most basic and straightforward algorithms that use regression to illustrate the relationship between continuous variables.
In machine learning, it is used to solve the regression problem.
The term "linear regression" refers to a statistical method that displays a linear relationship between the independent variable (X-axis) and the dependent variable (Y-axis).
Such linear regression is known as "simple linear regression" if there is only one input variable (x). Additionally, this type of linear regression is known as "multiple linear regression" if there are many input variables.

Y= aX+b

where Y = what we are trying to predict

X = features or variables we would be used to predict the value of Y

a = slope of the line

b = Intercept at the Y-axis (Similar to our maths linear algebra right)

Multiple linear regression

When more than one independent variable is used, multiple linear regression is used. Polynomial regression is an example of a multivariate linear regression technique. It is a sort of multiple linear regression used if there is more than one independent variable. When numerous independent variables are included, it achieves a better fit than simple linear regression. When plotted in two dimensions, the outcome would be a curved line fitting to the data points. Logistic regression is employed when the dependent variable can have one of two values, such as true or false, or success or failure. Logistic regression models can be used to forecast the likelihood of occurrence of a dependent variable. The output values must typically be binary. A sigmoid curve can be used to depict the relationship between the dependent and independent variables.

Polynomial Regression:

Polynomial Regression is a sort of regression that uses a linear model to represent the non-linear dataset. While it fits a non-linear curve between the value of x and related conditional values of y, it is comparable to multiple linear regression. Assume there is a dataset with sample data that are distributed in a non-linear form; in this situation, linear regression will not best match those data points. Polynomial regression is required to cover such data points. In Polynomial regression, the original characteristics are transformed into polynomial features of a specific degree and then modeled using a linear model.

Note: This differs from Numerous Linear regressions in that in Polynomial regression, a single element has different degrees rather than multiple variables with the same degree.

Support Vector Regression:

Support Vector Machine is a supervised learning algorithm that can be used for regression as well as classification problems. So if we use it for regression problems, then it is termed Support Vector Regression.

Support Vector Regression is a regression algorithm that works for continuous variables. Below are some keywords which are used in Support Vector Regression:

Kernel: It is a function that converts lower-dimensional data to higher-dimensional data.
Hyperplane: In general, SVM is a line that divides two classes, while SVR, is a line that helps forecast continuous variables and covers the majority of the data points.
Boundary line: These are two lines that are set aside from the hyperplane and creates a margin for data points are known as boundary lines.
Support vectors: The data points closest to the hyperplane and opposing class is known as support vectors.

In SVR, we always try to determine a hyperplane with a maximum margin, so that the maximum number of data points are covered in that margin. The basic purpose of SVR is to consider as many data points as possible within the boundary lines, and the hyperplane (best-fit line) must contain as many data points as possible.

Decision Tree Regression

Decision Trees are a supervised learning method for solving classification and regression issues.
It is capable of resolving issues with category and numerical data.
Decision Tree regression constructs a tree-like structure in which each internal node represents a "test" for an attribute, each branch indicates the test's result, and each leaf node provides the ultimate decision or result.
Starting with the root node/parent node (dataset), a decision tree is built, which divides into left and right child nodes (subsets of the dataset). These child nodes are further divided into their children nodes, and themselves become the parent node of those nodes. Consider the below image:
Random forest is a powerful supervised learning algorithm capable of handling both regression and classification problems.
The Random Forest regression is an ensemble learning method that combines multiple decision trees and predicts the final output based on the average of each tree output. The combined decision trees are called base models, and they can be represented more formally as:

g(x)= f0(x)+ f1(x)+ f2(x)+....

Random forest uses the Bagging or Bootstrap Aggregation technique of ensemble learning in which aggregated decision tree runs in parallel and do not interact with each other.
With the help of Random Forest regression, we can prevent Overfitting in the model by creating random subsets of the dataset.

Random Forest

A random forest is a meta estimator that fits several categorizing decision trees on different sub-samples of the dataset and utilizes averaging to increase predicted accuracy and control over-fitting. Some of the important parameters are highlighted below:

n_estimators — the number of decision trees you will be running in the model
criterion — This variable lets you choose the criterion (loss function) that will be used to decide model outcomes. We can choose between loss functions like mean squared error (MSE) and mean absolute error (MAE). MSE is the default.
max_depth — this sets the maximum possible depth of each tree
max_features — the maximum number of features the model will consider when determining a split
bootstrap — the default value for this is True, meaning the model follows bootstrapping principles (defined earlier)
max_samples — This parameter is only effective if bootstrapping is set to True; otherwise, it has no effect. When True, this variable specifies the largest size of each sample for each tree
Other important parameters are min_samples_split, min_samples_leaf, n_jobs, and others that can be read in the sklearn’s RandomForestRegressor documentation here.

We would be focusing on the linear regression model in this article. Subsequent articles showing illustrations for each of the following models would be released at a later date

A practical illustration of Linear Regression is shown in the code below. The code depicts a multiple linear regression. However, the same code can be run for a simple linear regression model. As a bonus, we first performed a quick analysis of the data before training

#importing our libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

#importing our dataset
dataset = pd.read_csv("/content/Real estate.csv")

#brief overview of what our data looks like
dataset.head()

#Getting descriptive information from our data
dataset.describe()
dataset.info()

#Carrying out our data analysis to see correlations between our data
import seaborn as sns

sns.jointplot(dataset["X1 transaction date"], dataset["Y house price of unit area"])
sns.jointplot(dataset["X2 house age"], dataset["Y house price of unit area"])
sns.jointplot(dataset["X3 distance to the nearest MRT station"], dataset["Y house price of unit area"])
sns.jointplot(dataset["X4 number of convenience stores"], dataset["Y house price of unit area"])
sns.jointplot(dataset["X5 latitude"], dataset["Y house price of unit area"])
sns.jointplot(dataset["X6 longitude"], dataset["Y house price of unit area"])
sns.pairplot(dataset)

sns.lmplot(x='X5 latitude',y ='Y house price of unit area', data=dataset)
sns.lmplot(x='X6 longitude',y ='Y house price of unit area', data=dataset)

#Splitting our data into a training set and testing set
y = dataset["Y house price of unit area"]
X = dataset[["X1 transaction date" ,"X2 house age", "X3 distance to the nearest MRT station", "X4 number of convenience stores","X5 latitude", "X6 longitude"]]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

#Training the simple linear model on the training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

#Predicting the test results
y_pred = regressor.predict(X_test)

#Visualising the test set results
plt.scatter(y_test,y_pred)
plt.xlabel('Years of Experience')
plt.ylabel('Salary')

#Calculating the mean absolute error, mean sqaured error and the root mean squared error
from sklearn import metrics

#Evailuating our model
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

#We move ahead to exploring the residuals to ensure everything is alright with our code
sns.distplot((y_test-y_pred),bins=50);

You can check out the full code here

Conclusion

In summary, regression models just help us predict whether we can buy a house based on some predetermined independent variables. The machine learns the pattern and can predict past and future occurrences