Linear regression Algorithm

#algorithms #datascience #machinelearning

Linear regression is a powerful and widely used algorithm in both statistics and machine learning. It helps model relationships between variables and make predictions.

Here are some compelling real-world applications:
Linear Regression – Real-Life Examples
House Price Prediction: Predicting the price of a house based on size, location, number of rooms, etc.
Sales Forecasting: Estimating future sales based on past sales data and advertising spend.
Student Performance: Predicting exam scores based on study hours, attendance, and previous grades.
Weather Prediction: Forecasting temperature based on historical weather data.
Medical Cost Estimation: Estimating hospital bills based on patient age, condition, and treatment type.
Stock Market Trends: Predicting future stock prices using past price data.
Traffic Prediction: Estimating traffic volume based on time of day and weather conditions.
Crop Yield Forecasting: Predicting how much crop will grow based on rainfall, soil quality, and fertilizer use.

The main aim of linear regression is to find the best fit line. The difference of data points that we have to the predicted data points and summation of all those should be minimal.

Linear line y is linear funciton of x (y=mx+c)

new age (i/p)

Train Dataset —→ Model ——→ Hypothesis ———⟶ weight (o/p)

Equation of straight line:

y=mx+c (Hypothesis)

c = intercept(when x=0 at what point our regression line is intercepting the y axis)
m= slope or coefficient - for unit movement in x-axis what is unit movement is y-axis is defined by the slope
(23:00 timestamp in krishnaik video: https://www.youtube.com/watch?v=JxgmHe2NyeY&t=2475s).
x is datapoints

How to find the best fit line : start at one point → best fit line
to find the best fit line we need to keep updating the m and c, for keep on updating this we need a Cost function

Cost function/Squared Error Function :

In the above formula its divided by 1/2 for getting derivation purpose
1/m is for getting the average

i=1 to m is to move towards all points.
h(x)-y is difference between the predicted value and the actual value.

What we need to solve: We need to minimize the cost function (Jθ0,θ1)by adjusting the θ0/c and θ1/m i.e slope and intercept

Lets take an example data set as (1,1) (2,2) (3,3)

But even here we are assuming the values of θ0 and θ1 and calculating the gradient decent and global minima but this should be like probably we come to one point and reach to the global minima point. How to do that, coming to one point and moving toward global minima point. For that we use the Convergence algorithm

CONVERGENCE ALGORITHM: : Repeat until convergence

positive slope: slope pointing upwards towards right
negative slope: slope pointing upwards towards left

α - Learning rate - Learning rate is by what number/size it should move to reach the global minima.

This should always be a small number(not too small not big). If its too small it takes longer time (tiny steps) to reach the global minima point. If its too huge it would jump here and there and doesn't come to global minima.

In positive slope after subtracting ‘n’ no of iterations(α) we would be coming to a global minima point.

In negative slope after adding ‘n’ no of iterations(α) we would be coming to a global minima point.
Another scenario is what if my cost function has local minima:

In this point slope will be 0 then θ1= θ1- α(0)= θ1 i.e θ1=θ1 then we will be stuck in local minima. But usually by using this gradient decent and the cross function equation that we are using here we don't get stuck local minima. Because our gradient decent will always looks like below

                    Cost Function Shape

But in Deep learning when we learn about gradient decent we get local minima but we use different Techniques like momentum, Adam optimizer, learning rate schedules, and batch normalization help navigate these challenges.

Deep Learning: Many Local Minima. Cost Function Shape in Deep learning models, especially deep neural networks, have non-convex cost functions due to their complex architecture and multiple layers.
Hence Cost Function Shape In linear regression, the cost function (usually Mean Squared Error) is convex.

A convex function has only one global minimum and no local minima. This means gradient descent will always converge to the best solution, regardless of the starting point. Because of this convexity, optimization is straightforward and reliable.
when will convergence stop? When the J(θ1) is very very less.(green point for reference)

GRADIENT DESCENT ALGORITHM:
Repeat until Convergence

If we have multiple features then we have a 3D curve

PERFORMANCE METRICS - we use this to verify our model and how good our model is with respective linear regression

R-squared and Adjusted R-squared:
R-squared and adjusted R-squared both measure a regression model's goodness of fit, but adjusted R-squared is a modified version that accounts for the number of predictors in the model.

While R-squared always increases when you add more variables, adjusted R-squared can decrease if an added variable doesn't sufficiently improve the model.
Therefore, adjusted R-squared is considered a more reliable measure for comparing models with different numbers of independent variables.

green dots are predicted data points
y bar is mean
Ex: Gender Bedrooms Location Price - Model

R squared values keeps increased even if we add one more feature which is not at all related. Here in above example though the gender column is no way related to the data where price is independent of the gender the R squared value gets increased where as its not the case with Adjusted R squared.

Adjusted R-squared: