DEV Community: Suraj J

MLS.1.b Gradient Descent in Linear Regression

Suraj J — Tue, 12 Nov 2019 06:13:28 +0000

Gradient Descent in Linear Regression

Gradient Descent is a first order optimization algorithm to find the minimum of a function.It finds the minimum (local) of a function by moving along the direction of steep descent (downwards). This helps us to update the parameters of the model (weights and bias) more accurately.

To get to the local minima we can't just go directly to the point (on a graph plot). We need to descend in smaller steps and check for minima and take another step to the direction of descent until we get our desired local minimum.

The small steps mentioned above is called as learning rate. If the learning rate is very small the precision is more but is very time consuming. And large learning rate may lead us to miss the minimum (overshooting). The theory is to use the learning rate at higher rate until the slope of curve starts to decrease ad once it starts decreasing, we start using smaller learning rates(Less time and More Precision).

Gradient descending over a slope

The cost function helps us to evaluate how good our model is functioning or predicting. Its a loss function. It has its own curve and parameters (weights and bias). The slope of the curve helps us to update our parameters accordingly. The less the cost more the predicted probability.

In the training phase, we are finding the y_train value to find how much is the value is deviating from the given output. Then we calculate the cost error in the given second phase by using the cost error formula.

y_train = w * xi + b
cost = (1/N) * ∑(yi − y_train)² {i from 1 to n}

for i in range(n_iters):
    #Training phase 
    y_train = np.dot(X, self.weights) + self.bias

    #Cost error calculating Phase
    cost = (1 / n_samples) * np.sum((y_train - y)**2)
    costs.append(cost)

Now we update the weights and bias to decrease our error by doing

    #Updating the weight and bias derivatives
    Delta_w = (2 / n_samples) * np.dot(X.T, (y_hat - y))
    Delta_b = (2 / n_samples) * np.sum((y_hat - y)) 

    #Updating weights
    self.weights = self.weights - learn_rate * Delta_w
    self.bias = self.bias - learn_rate * Delta_b

    # end of loop

And ploting cost function against iterations

Cost against iterations

Above given is a cost function curve against number of iterations. As number of iterations increases (steps) cost decreased drastically meaning minimum is nearby and almost became zero. We do the above updations until the error becomes negligible or minimum is reached.

Source code from Scratch

class LinearModel:
    """
    Linear Regression Model Class
    """
    def __init__(self):
        pass

    def gradient_descent(self, X, y, learn_rate=0.01, n_iters=100):
        """
        Trains a linear regression model using gradient descent
        """
        n_samples, n_features = X.shape
        self.weights = np.zeros(shape=(n_features,1))
        self.bias = 0
        self.prev_weights = []
        self.prev_bias = []
        self.X = X
        self.y = y
        costs = []

        for i in range(n_iters):
            """"
            Training Phase
            """
            y_hat = np.dot(X, self.weights) + self.bias
            """
            Cost error Phase
            """
            cost = (1 / n_samples) * np.sum((y_hat - y)**2)
            costs.append(cost)
            """
            Verbose: Description of cost at each iteration
            """
            if i % 200 == 0:
                print("Cost at iteration {0}: {1}".format(i,cost))
            """
            Updating the derivative
            """
            Delta_w = (2 / n_samples) * np.dot(X.T, (y_hat - y))
            Delta_b = (2 / n_samples) * np.sum((y_hat - y)) 

            """"
            Updating weights and bias
            """
            self.weights = self.weights - learn_rate * Delta_w
            self.bias = self.bias - learn_rate * Delta_b

            """
            Save the weights for visualisation
            """
            self.prev_weights.append(self.weights)
            self.prev_bias.append(self.bias)

        return self.weights, self.bias, costs

    def predict(self, X):
        """
        Predicting the values by using Linear Model
        """
        return np.dot(X, self.weights) + self.bias

# We have created our Linear Model class. Now we need to create and load our model.
model = LinearModel()

w_trained, b_trained, costs = model.gradient_descent(X_train, y_train, learn_rate=0.005, n_iters=1000)

    def visualize_training(self):
        """
        Visualizing the line against the dataset        
        """

        self.prev_weights = np.array(self.prev_weights)

        x = self.X[:, 0]
        line, = ax.plot(x, x, color='red')
        ax.scatter(x, self.y)


        def animate(line_data):
            m, c = line_data
            line.set_ydata(m*x + c)  # update the data
            return line,

        def init():
            return line,

        def get_next_weight_and_bias():
            for i in range(len(self.prev_weights)):
                yield self.prev_weights[i][0], self.prev_bias[i]

        return animation.FuncAnimation(fig, animate, get_next_weight_and_bias, init_func=init,interval=35, blit=True)

# Visualization of training phase to get the best fit line
fig, ax = plt.subplots()
ani = model.visualize_training()
plt.show()

# Prediction Phase to test our model  
n_samples, _= X_train.shape
n_samples_test, _ = X_test.shape

y_p_train = model.predict(X_train)
y_p_test = model.predict(X_test)

error_train =  (1 / n_samples) * np.sum((y_p_train - y_train) ** 2)
error_test  =  (1 / n_samples_test) * np.sum((y_p_test - y_test) ** 2)

print("Error on training set: {}".format(np.round(error_train, 6)))
print("Error on test set: {}".format(np.round(error_test, 6)))

# Plotting predicted best fit line
fig = plt.figure(figsize=(8,6))
plt.scatter(X_train, y_train)
plt.scatter(X_test, y_p_test)
plt.xlabel("x")
plt.ylabel("y")
plt.show()

Predicted Output

Check out the full source code for Gradient Descent on GitHub
and also check out the other approaches in Linear Regression by ML-Scratch

Contributors

This series is made possible by help from:

Pranav (@devarakondapranav )
Ram (@r0mflip )
Devika (@devikamadupu1 )
Pratyusha(@prathyushakallepu )
Pranay (@pranay9866 )
Subhasri (@subhasrir )
Laxman (@lmn )
Vaishnavi(@vaishnavipulluri )
Suraj (@suraj47 )

MLS.1.a Concepts for Linear Regression

Suraj J — Tue, 12 Nov 2019 06:13:16 +0000

The idea behind simple linear regression is to "fit" the observations of two variables into a linear relationship between them. Graphically, the task is to draw the line that is "best-fitting" or "closest" to the points.

The equation of a straight line is written using the y = mx + b, where m is the slope (Gradient) and b is y-intercept (where the line crosses the Y axis).Where, m is the slope (Gradient) and b is y-intercept (Bias).In calculation of mean and y-intercept, we use some of the mathematical concepts explained below:

Mean

This term is used to describe properties of statistical distributions. It is determined by adding all the data points in a population and then dividing the total by the number of points. The resulting number is known as the mean or the average.

x̄ = Sum of observations / number of observations

Variance

Variance (σ2) is a measurement of the spread between numbers in a data set. That is, it measures how far each number in the set is from the mean and therefore from every other number in the set.

Variance = n * sum of all(xi − x̄)²
Where:
     xi = i^th data point
     x̄ = the mean of all data points
     n = the number of data points

Co-variance

Covariance is a measure of how much two random variables vary together. It’s similar to variance, but where variance tells you how a single variable varies, co-variance tells you how two variables vary together.Square root of variance is called Standard Deviation

Cov(X,Y) = Σ (E(X) - μ) * (E(Y) - ν) / (n - 1)
Where
     X is a random variable
     E(X) = μ is the expected value (the mean) of the random variable X
     E(Y) = ν is the expected value (the mean) of the random variable Y
     n = the number of items in the data set

Correlation

Correlation(r) is a statistical technique that can show whether and how strongly pairs of variables are related.

Sx, Sy = Standard deviation of x, y

Root Mean Square Error (RMSE)

Root Mean Square Error (RMSE) is the standard deviation of the residuals(prediction errors). Residuals are a measure of how far from the regression line data points are.

Calculation of Slope and Bias

The slope of the line is calculated as the change in y divided by change in x.

slope m = change in y / change in x

The y-intercept over bias shall be calculated using the formula

y = m(x - x1) + y1

These values are different from what was actually there in the training set and if we plot this (x, y) graph against the original graph, the straight line will be way off the original points in the graph. This may lead to error which is the difference of values between actual points and the points on the straight line. Ideally, we’d like to have a straight line where the error is minimized across all points.Error can be reduced using many mathematical ways. One of such method is "Least Square Regression"

Least Square Regression

Least Square Regression is a method which minimizes the error in such a way that the sum of all square error is minimized.

m = (Σ ((x - x̄) * (y - ȳ)) / Σ (x - x̄))²
(or)
m = r(Sy / Sx)
(and we get the y-interept)
b = ȳ - m * x̄

Where
     Sx is standard deviation of x
     Sy is standard deviation of y
     r is correlation between x and y
     m is slope
     b is the y-intercept

This method is intended to reduce the sum square of all error values. The lower the error, lesser the overall deviation from the original point.

Cost Function

The cost function calculates the square of the error for each example in the dataset, sums it up and divides this value by the number of examples in the dataset (denoted by m).This cost function helps in determining the best fit line.The cost function for two variables θ0 and θ1 denoted by J and is given as follows.

Now, we have to make use of cost function to adjust our parameters θ0 and θ1 such that they result in the least cost function value. We make use of a technique called Gradient Descent to minimize the cost function.

Read On 📝

MLS.1.b Gradient Descent in Linear regression

Contributors

This series is made possible by help from:

Pranav (@devarakondapranav )
Ram (@r0mflip )
Devika (@devikamadupu1 )
Pratyusha(@prathyushakallepu )
Pranay (@pranay9866 )
Subhasri (@subhasrir )
Laxman (@lmn )
Vaishnavi(@vaishnavipulluri )
Suraj (@suraj47 )

MLS.1 Linear Regression

Suraj J — Tue, 12 Nov 2019 06:12:55 +0000

The source code for the topics discussed in the post can be found at https://github.com/ML-Scratch/ML_Code_From_Scratch

Linear regression is a very basic supervised learning model. It is used when there is a linear relationship between the feature vector and the target , or in simple terms between the input and output we are trying to predict. Linear regression serves as the starting point for many machine learning enthusiasts and understanding this model can greatly help in mastering the more complex models in ML.

When should you use Linear Regression?

As the name suggests, Linear Regression involves fitting the best fit straight line through the data. Consider a dataset consisting information of used cars and the prices they were sold for. For example, it consists of the number of kilometers the car traveled and the price it was sold. As one might realize, there could be a linear relationship between the number of kilometers traveled and the selling price.

A visualisation of the data like in the above image makes it clear that a straight line can be fit for this kind of data. One should also note that in most of the cases it is impossible to fit at line that passes through all the points in the dataset. The best we could do is to fit a straight line that looks like passing through most of the points and we will see in the coming sections how we could do so. Once we fit a line through this data i.e generate a line equation, we can start predicting prices by plugging in the number of kilometers the car has traveled into the line equation.

Understanding the math

The math behind the working of Linear regression is not at all complicated. For simplicity let’s assume we have only one feature i.e the no kilometers traveled (let's call this X) in the dataset and one column with the selling prices of these cars (let's call this Y).

Our job is to create a line equation like Y = mX + c . When the value of X (i.e the number of kilometers traveled) from the dataset is plugged into this equation it should calculate Y (i.e the predicted selling price) that is either equal to the selling price value from the dataset or some close enough value.

As you might incur, the variables in the above line equation are m and c which are nothing but the slope of the line and the y intercept of the line. Remember that X and Y are not the variables in our case as they are nothing but constants from our dataset that we will use in creating the best fit line.

So our job is now to find out the right m and c values so that we can make an ideal straight line that passes through most of the points in the dataset.

Let's modify the above equation slightly to Y = θ0 + θ1X
Where θ0 = c and Y = θ1 = m.

How to decide if a line is good enough?

Now that we understand the line equation, how should we decide if the line equation we are using the best fit line or not. An obvious way to do this is to plot the line against the dataset and visually decide.

However this is not practically possible for huge datasets with large no of features, which is often the case with most of the real world datasets. Hence we use a simple mathematical formula called the cost function to decide if a given line is a good fit or a bad fit to the data.

Consider the following mini dataset:

S.no	Km's travel	Price
1	1000	2.1
2	2000	1.7

Suppose we start with random values for θ0 = 10 and θ1 = 20. Let us plug in X = 1000 as per the first example in the dataset.

Y = 10 + 20(1000)
Y = 20010

The predicted value according to the above equation is 20010 rupees whereas the selling price according to the dataset is 200000 rupees. This is definitely a bad prediction. The magnitude of the badness of this prediction or technically the error is the difference in the predicted value(denoted by Ŷ and the actual value(denoted by Y).

In this the predicted value Ŷ = 20010 where as the actual value(or value from the dataset) is Y = 210000.

The cost function for two variables θ0 and θ1 are denoted by J and is given as follows

The cost function calculates the square of the error for each example in the dataset, sums it up and divides this value by the number of examples in the dataset (denoted by m).

This cost function helps in determining the best fit line.

Note: The division with 2 is to simplify calculations involving the first order differentials

Arriving at the best fit line

Now that we have defined the cost function, we have to make use of it to adjust our parameters θ0 and θ1 such that they result in the least cost function value. We make use of a technique called Gradient Descent to minimize the value of the cost function.

Source: https://mccormickml.com/2014/03/04/gradient-descent-derivation/

Gradient descent makes small changes to existing θ0 and θ1 values such that they result in more and more smaller cost function values. The changes to θ0 and θ1 are performed as follows.

Where j = 0 or j = 1

Let's try to understand, what this updating to `θ0` and `θ1` mean?

The differential part of this equation i.e determines whether we have to increment or decrement the value θj. If this differential is a positive value then θj is decremented and if this differential is a negative value then θj is decremented as it can be observed from the above equation.

θ vs Cost function (J(θ))

Now that we know if we have to increment or decrement θj, next we have to determine by how much θj should be changed. This is what α or the learning rate indicates. Larger the α value, the larger is the updation for θj and vice versa. The value of α should not be too small as it will result in very slow convergence to the best fit line and it should not be too large as we might miss the values of θj which result in the best fit line.

One set of updations of θj is called an iteration of Gradient Descent.

This process of updation is repeated till the point where the cost function value remains largely unchanged.

After a sufficient number of iterations of gradient descent, we can visually check the performance the line by plotting it against the values in the dataset. If everything goes right, you should have a pretty decent line. You can now use this line equation to make predictions for any given X value (or the number of kilometers traveled).

Pros

Space complexity is very low it just needs to save the weights at the end of training. Hence it's a high latency algorithm
Its very simple to understand
Good interpretability
Feature importance is generated at the time model building
With the help of hyperparameter lambda, you can handle features selection hence we can achieve dimensionality reduction
Small number of hyperparmeters
Can be regularized to avoid overfitting and this is intuitive
Lasso regression can provide feature importances

Cons

The algorithm assumes data is normally distributed in real but they are not
Before building a model multi-collinearity should be avoided.
Prone to outliers.
Input data need to be scaled and there are a range of ways to do this.
May not work well when the hypothesis function is non-linear.
A complex hypothesis function is really difficult to fit. This can be done by using quadratic and higher order features, but the number of these grows rapidly with the number of original features and may become very computationally expensive.
Prone to overfitting with a large number of features are present.
May not handle irrelevant features

So far so good, we have learned overview of Linear Regression our next post revolves around the math concepts involved in Linear Regression.

Read On 📝

Contributors

This series is made possible by help from:

Pranav (@devarakondapranav )
Ram (@r0mflip )
Devika (@devikamadupu1 )
Pratyusha(@prathyushakallepu )
Pranay (@pranay9866 )
Subhasri (@subhasrir )
Laxman (@lmn )
Vaishnavi(@vaishnavipulluri )
Suraj (@suraj47 )

Introduction

Suraj J — Tue, 12 Nov 2019 06:12:08 +0000

What is Machine Learning?

Machine Learning is an application of Artificial Intelligence(AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. This involves the task of learning from data with specific inputs to the machine.

It’s important to understand what makes Machine Learning work and, thus, how it can be used in the future. This blog helps in understanding each concept of ML from basics and mathematics associated with it. Math concepts are the integral part of ML.

What is ML-Scratch?

ML-Scratch is an organisation that focuses on teaching machine learning algorithms from the primitive level.

We provide detailed explanation of different concepts, such that one can code from the start (scratch, as we say) without using any imported functions.

DEV Community: Suraj J

MLS.1.b Gradient Descent in Linear Regression

Gradient Descent in Linear Regression

Source code from Scratch

Contributors

MLS.1.a Concepts for Linear Regression

Mean

Variance

Co-variance

Correlation

Root Mean Square Error (RMSE)

Calculation of Slope and Bias

Least Square Regression

Cost Function

Read On 📝

Contributors

MLS.1 Linear Regression

When should you use Linear Regression?

Understanding the math

How to decide if a line is good enough?

Arriving at the best fit line

Let's try to understand, what this updating to θ0 and θ1 mean?

Pros

Cons

Read On 📝

Contributors

Introduction

What is Machine Learning?

What is ML-Scratch?

Read On 📝

Let's try to understand, what this updating to `θ0` and `θ1` mean?