DEV Community: CoffeeBeans Consulting

Extending Linear Programming to Statistical Learning

Pratik Patre — Tue, 21 Jun 2022 06:48:13 +0000

What is linear programming?

In simple words, it is a mathematical solution to a complex problem. Believe it or not, we have been using linear programming in everyday life (technically not programming in our minds but the concept). Simple examples would be packing your bag for a trip, managing time in an exam or buying a variety of chocolates within the set budget. It is basically an optimisation problem that we try to solve and by saying that I meant frequentist techniques like Maximum Likelihood (MLE). Mathematically, we usually solve by using some kind of variables and constraints to get the maximum or the minimum of the objective set.

The process to formulate a Linear Programming problem?

Let us look at the steps of defining a Linear Programming problem generically:

We first decide the unknowns which act as decision variables

We define the objective function

Identify the constraints i.e. the system of equalities or inequalities representing the restrictions on the decision variables

Non-negativity restriction

For a problem to be a linear programming problem, the decision variables, objective function and constraints all have to be linear functions. If all the three conditions are satisfied, it is called a Linear Programming Problem.

Linear Programming in action

Ordinary Least Squares (OLS) Basics
Let us see how linear programming can be clubbed with gradient descent and used in linear regression in machine learning to solve an optimisation problem. Let's start simple, in linear regression we predict a quantitative response y on the basis of a single predictor variable x. It assumes a linear relationship between x and y i.e. y ~ w0 + w1*X. We use the training data to produce model coefficients w0 and w1. We can predict yhat based on X = x i.e. yhat = w0 + w1*x. Using the training data we find a line that most closely represents the data points. There are various ways to measure what it means to closely represent. We may, for instance, minimise the average distance (deviation) of the data points from the line, or minimise the sum of distances, or the sum of squares of distances, or minimise the maximum distance of a datapoint from the line. Here the distance can be either Euclidean distance, or vertical distance, or Manhattan distance (vertical+horizontal), or other.
We choose to minimise the maximum vertical distance of a point from the line. In two dimensional space, a general equation of a line with finite slope has the form y = w0 + w1*x where w0 and w1 are parameters (slope and constant respectively). For a point (p, q), the vertical distance of the point from the line y = w0 + w1*x can be written as |q − w1*p − w0| which is nothing but residual (e1) and if we square it and sum it over n number of points in our data, we get the residual sum of squares i.e. RSS = e1² + e2² ….en².

Let's Start with an Example

Defining decision variable and objective function
Now consider, we want to predict a house's price with respect to one variable which is the area of the house. So this is our decision variable which we denote as x. Say for a sample point, g(w) = w0 + w1*x is the expected price of a house based on x area of the house. Now, y = g(w) = w0 + w1*x will be marked as an objective function as we have to optimise this line to best fit the data.

Now I will not go into the proof but if we consider the partial derivative of y with respect to w0 and w1 and represent that in a vector form that's when we get the gradient of y. So, below would be the mathematical expression.

The cost function is the residual sum of squares and the gradients for the RSS with respect to w0 and w1 will be as shown by RSS(w0,w1) above in vector form.

Defining Constraints
So now we can define our constraints,

We will have to update the w0 and w1 based on the following equations:
w0 = w0 -Learning rate * partial derivative of RSS function with respect to w0
For simplicity we can write it w0 = w0 + 2N * B…………..(2)
Where B = (sum of errors after prediction over all data points)

w1 = w1 -Learning rate * partial derivative of RSS function with respect to w1
For simplicity we can write it w1 = w1 + 2N * C..…………(3)
Where C = (sum of (errors*x) after prediction over all data points)

We repeat the above two steps until convergence, i.e. until B-C < t (tolerance value)

Now we can rewrite it as B + (-C) < t

Squaring on both sides (B + (-C))² < t²

Subtracting 2*B*C on both sides (B + (-C))² -2*B*(-C) < t²- 2*B*(-C)

Now a² + b² formula, LHS will turn to be B² + (-C)² i.e. B² + C² < t² + 2*B*C

Taking square root on both the sides we get
Square root(B² + C²) < square root(t² + 2*B*C) ………….(1)

Now technically the convergence is the magnitude of the gradient vector = 0 but in practice we consider the magnitude of the gradient vector < T
Where T is tolerance value

Now as per the equation (1), the LHS is nothing but the magnitude of the gradient vector and the RHS is the Tolerance value. We can rewrite (1) as follows
Square root(B² + C²) < T ………….(1a)

Defining Non-negativity restriction
The non-negativity restriction i.e. the area of the house will be greater than or equal to zero x >= 0.

So now we have formulated our linear program with the help of gradient descent, let us solve for the optimum line.

Operationalising Linear Programming

Let's consider the following data points

We start with assuming intercept = w0 = 0 and slope = w1 = 0 and by setting stepsize or learning rate = 0.025 and tolerance = 0.01

Step 1: We predict and find for yhat based on yhat = w0 + w1*x.
We get the following predictions for the five data points [0,0,0,0,0]

Step 2: We find the errors after prediction by yhat - y
We get the following errors [-1,-3,-7,-13,-21]

Step 3: We update the intercept by summing the errors and inserting it into equation (2) to get
0 + 2*0.025*sum([-1,-3,-7,-13,-21]) = -2.25

Step 4: We update the slope by summing the multiplication of errors and respective area of house as per equation (3). Thus equation 3 will turn to be:
0 + 2*0.025*sum([0, 1, 2, 3, 4] * [-1, -3, -7, -13, -21]) = -7

Step 5: We calculate the magnitude which is LHS from equation (1a):
sqrt(( -45)² + (-140)²) = 147.05

Step 6: We check equation (1a) to see if it converged. Since 147.05 is greater than 0.01 (the tolerance value set by us), we repeat the steps again but this time with the updated w0 and w1.

Since the purpose of this session is to see how linear programming can be used in OLS, we will stop here. But feel free to code this and get the optimum value for w0 and w1.

Activation functions in Neural Networks | Deep Learning

lakshaywadhwa7 — Mon, 20 Jun 2022 07:32:58 +0000

What are activation functions and why we need those?

**
Activation functions are functions which are used in the Artifical Neural Networks to capture the complexities inside the data. A neural network without an activation function is just a simple regression model.The activation function does the non-linear transformation to the input making it capable to learn and perform more complex tasks.We introduce non-linearity in each layer through activation functions.

Let us assume there are 3 hidden layers, 1 input and 1 output layer.
W1-Weight matrix between Input layer and first hidden layer
W2-Weight matrix between first hidden layer and second hidden layer
W3-Weight matrix between second hidden layer and third hidden layer
W4-Weight matrix between third hidden layer and output layer
Below mentioned equations represents a feedforward neural network.

If we stack multiple layers, we can see output layer as a function:

What are Ideal qualities of an activation function:

Non-Linearity:
The activation function generally introduce non-linearity in the network to capture the complex relations between input features and output variable/class.
Continuously differentiable:
The activation function needs to be differentiable since neural networks are generally trained using gradient descent process or to enable gradient based optimzation methods.
Zero centered:
Zero centered activations functions makes sure that mean activation value is around 0. This is important because convergence is usually seen faster on normalized data. I have explained many of the commonly used activation below, some are zero centered some are not. Mostly when we have a activation function which is not zero centered we tend to use normalization layers like batch normalization to mitigate this issue.
Computational expense should be low:
Activation functions are used in each layer of the network and is computed a lot of times, hence its computation should be easy and not very computationally expensive.
Killing gradients:
Activation functions like sigmoid has a saturation problem where the value doesn't change much for large negative and large positive values.

The derivative of the sigmoid function gets very small there which in turn prevents the updation of the weights in inital layers during backpropagation and hence the network doesn't learn effectively. This should be avoided to learn patterns in the data and hence the activation function should not ideally suffer from this issue.

Most commonly used activation functions:

In this section we will go over different activation functions.
1.Sigmoid function-

The sigmoid function is defined as:

Sigmoid function

The sigmoid function is a type of activation function which has a chracteristic "S" shaped curve which has domain of all real numbers and output between 0 and 1. An undesirable property of the sigmoid function is that the activation of the neuron saturates either at 0 or 1 when the input from the neuron is either large positive and large negative. It is also non-zero centered which makes neural network learning difficult. In almost majority of the cases, it is always better to use Tanh activation function instead of sigmoid activation function.

2.Tanh function-

tanh function

Tanh has just one advantage over sigmoid function that it is zero-centered and it's value is binded between -1 and 1.

3.RELU(Rectified Linear Unit)-

relu function

RELU is one of the many non zero-centered activation function and given this disadvantage it is still widely used because of the advantages it has. It is computationally very inexpensive, does not cause saturation and does not cause the vanishing gradient problem. The RELU function doesn't have a higher limit, hence it has a problem of exploding activations and on the other hand for negative values, it has 0 activation and hence it completely ignores the nodes with negative values. Hence it suffers from "dying relu" problem.
Dying ReLU problem: During the backpropagation process, the weights and biases for some neurons are not updated because its nature where activation is zero for negative values. This might create dead neurons which never get activated.

4.Leaky RELU-

Leaky RELU is a type of activation function based on RELU function with a small slope for negative values instead of zero.

leaky relu function

Here, alpha is generally set to 0.01. It solves the "dying RELU" problem and also its value is generally small and is not set near to 1 since it will only be a linear function then.
If we use alpha as hyperparameter for each neuron, it becomes a PReLU or parametrized RELU function.

5.ReLU6-

This version of ReLU function is basically a ReLU function restricted on the positive side.

relu6 function

This helps in containing the activation function for large input positive values and hence stops the gradient to go to inf value.

6.Exponential Linear Units (ELUs) Function-

Exponential Linear Unit is also a version of ReLU that modifies the slope of the negative part of the function.

exponential linear unit

This activation function also avoids dead ReLU problem but it has exploding gradient problem because of no constraint on the activations for large positive values.

7.Softmax activation function-

softmax activation function

It often used in the last activation layer of a neural network to normalize the output of a network to a probability value that in turn is mapped to each class which helps us in deciding the probability of output belonging to each class with respect to given inputs.It is popularly used for multi-class classification problems.
I hope you enjoyed reading this. I have tried to cover many of the activation functions which are commonly used in Neural Networks. If any mistake found, please feel free to mention and the blog will be corrected.