mac172

Posted on Mar 30, 2022

Simple maths behind linear regression model

#maths #beginners #ai

This blog is a little bit long as I want to explain whole linear regression model with error/cost function and gradient descent algorithm

AI this term is making huge interest in peoples mind. But do you know what it is really?, to tell you it is nothing but mathematical equations playing around with large, really large, really really large datasets. Mathematics work is to describe the real world with logic that people want it, it's nothing special and that's why it is created in first place.

Back to topic, machine learning (subfield of AI) is taking most of the concept from mathematics and one of them is "linear regression".

'Linear regression is a linear model which describes linear combination between input variables to output variables'

Linear Regression

Regression means continuous, so output we get from it is also continuous. It is in supervised machine learning type, containing input with output associated with it like

{\lbrace (X_1,y_1),(X_2,y_2),...,(X_n,y_n) \rbrace}

or if you are fan of math's

in less messy way.

Main concept in linear regression comes from equation of line
y = mx + b
where, m = slope of line generally determine by

y = {y_2 - y_1 \above{2pt} x_2 - x_1}

and b = y-intercept

That's it the core idea in linear regression. There are lot of horrible mathematical notation used for writing clear and less messy equations, but their function are easy to understand so don't fear.

First let's start with linear regression equation, don't worry I will clear it's function that easy to understand.

Equation

h_\theta(x) = \sum_{i=0}^n {\theta_i x_i}

Now let's break down it little bit. first

h_\theta(x) = output

and

x_0 = 1

this two things are clear to us now remainings,

x = numbers of features/inputs

\theta = parameters

if you know little bit about neural network, then you heard about term called "weights". Weights are nothing but vector of numbers which help input value to match it's output value approximately.

for example 2y = 6, now what value of 'y' make this equation balance? Of course 3, that's how weights are work, it calculated again and again until best matching value obtained.

and lastly

n = number of samples or examples

Now we get output from our Linear Regression equation but what next? The output is not what we expected. How we will get our expected output?

To do this we have another two concepts called "Cost Function" and "Gradient Descent". Let's talk about cost function first.

Cost Function

Cost function is method of calculating error done by model when predicting output.

It's simply difference between predicted output and actual output.

One of them Mean Square Error

J(\theta) = {\sum_{i=0}^n {(h(x^i) - y^i)^N} \above{1pt}2}

and to take minimum value

min(J(\theta))

Here, $x^i$ and $y^i$ are $i^{th}$ value.

Gradient Descent

It is an optimization algorithm usually used to reduce cost function. Core concept is based on differentiation, it is used for finding minima by moving in negative direction.

Before moving to it, first see term called learning rate.

Learning Rate $(\alpha)$

Learning rate are steps taken by gradient descent algorithm for how much area will calculate at a time.

A high learning rate maybe cover large area in less time but it can be overshoot model learning capabilities, means it learn same feature again and again and behave worse in real world situation.

Low learning rate might be useful but it takes lot of time to reach minima.

Let's move to gradient descent algorithm.

\theta_j = \theta_j - \alpha \sum_{i=0}^m {(h(x^i) - y^i) . x_j}

Above equation is the gradient descent equation. Note that we minus the differential value from our parameter (weights), why we do that? The answer is, suppose if we add them instead of minus then algorithm go in upward or postive direction which search for maxima instead of minima and we get huge error in our model and that is bad thing. Now let's break down it little bit.

We know $\theta_j$ as $j^{th}$ parameter.

$\alpha$ is learning rate, how much step we want to take at a time.

$h(x^i) - y^i$ difference between $i^{th}$ predicted output and $i^{th}$ actual output.

$x_j$ is $j^{th}$ input value.

This algorithm act on all input datasets, means the time consuming for training is directly proportional to input size of datasets. If datasets is lage, time required is also large. For this we have it's another brother called "Stochastic gradient descent".