DEV Community


Posted on • Updated on



This all started with a guy name Francis Galton who was investigating the relationship between the heights of fathers and their sons.

What he discovered was that a man's son tended to be roughly as tall as his father. However Galton's breakthrough was that son's height tended to be closer to the overall average height of all people.

Galton named this phenomena REGRESSION, as in

"A father's son's height tends to regress (or drift towards) the mean (average) height."

Before getting started with Linear Regression. Let's first have a look what Regression Analysis means?

Regression Analysis

Regression Analysis is a statistical tool that allow us to quantify the relationship between a particular variable and an outcome. It has the amazing capacity to isolate a statistical relationship that we care about, while taking into account other factors that might confuse the relationship. In other words, we can isolate the effect of one variable, while holding the effects of other variables constant.

For example:

We want to see how height effects the weight. Formally speaking, we want to quantify the relationship betweent two variables Height and Weight. Here, Height is independent, controlled, explanatory variable while Weight is dependent, outcome variable. The regression analysis allow us to quantify this relationship. However, there can also be multiple factors/variables which can effect the Weight (Dependent Factors) such as Sex (Female/Male), Age, Excercise, and Race. So, Regression Analysis also allow us to take into account these factors/variables as well which help us in identifying the relationship more accurately between variables we are interested in (weight and height) by controlling these rest of the variables (Independent variables) or may be we want to see how more than one feature/independent variable effects the dependent variable.

Linear Regression

When we are trying to understand the relationship between TWO variable that is what change happen in Dependent Variable(Y) if we make changes to independent variable(X).

  1. If independent variable increases, so does the dependent variable we say there is a positive relationship.
  2. If dependent variable increases, but the dependent variable decreases we say there is a negative relationship.

Linear regression is a method for finding the straight line that best fits a set of points. We take observations and plot these observation on the graph. Then, we try to find the straight lines that fits through all these points. This is nothing but Regression Line.

For instance as a birthday gift, your Aunt Ruth gives you her cricket database and asks you to learn a model to predict this relationship.

Using this data, you want to explore this relationship.
First, examine your data by plotting it:

Image description

As expected, the plot shows the temperature rising with the number of chirps. Is this relationship between chirps and temperature linear? Yes, you could draw a single straight line like the following to approximate this relationship:

Image description

All we're trying to do when we calculate our regression line is draw a line that as close to every dot (observation) as possible. True, the line doesn't pass through every dot, but the line does clearly show the relationship between chirps and temperature. Using the equation for a line, you could write down this
relationship as follows:

y=mx + b
Enter fullscreen mode Exit fullscreen mode

This equation is also known as Regression Equation. The result is not only a line but an equation describing the line. It also means that every observation can be explained as TEMPERATURE = b + m(CRICKET CHIRPS) + e, where e is a “residual” that catches the variation in weight for each individual that is not explained


  • y is the temperature in Celsius—the value we're trying to predict.
  • m is the slope of the line.Describes the “best” linear relationship between temperature and cricket chirp for this sample.
  • x is the number of chirps per minute—the value of our input feature.
  • b is the y-intercept(the value of y when x=0).

By convention in machine learning, you'll write the equation for a model slightly differently:

Enter fullscreen mode Exit fullscreen mode


  • y′ is the predicted label (a desired output).
  • b is the bias (the y-intercept), sometimes referred to as w0.
  • w1 is the weight of feature 1. Weight is the same concept as the "slope" m in the traditionalequation of a line.
  • x1 is a feature (a known input).

To infer (predict) the temperature y′ for a new chirps-per minute value x1, just substitute the x1 value into this model.

Understanding/interpreting Regression Equation.

Say we got the following Regression Equation (For relation b/w weight & height).

WEIGHT = –135 + (4.5) × HEIGHT IN INCHES
Enter fullscreen mode Exit fullscreen mode

a = –135. This is the y-intercept, which has no particular meaning on its own. (If you interpret it literally, a person who measures zero inches would weigh negative 135 pounds; obviously this is
nonsense on several levels.) This figure(-135) is also known as the constant, because it is the starting point for calculating the weight of all observations in the study.

The slope here which is 4.5 here is also known as *Regression Coefficient, or in statistics jargon, “the coefficient on height,” because it gives us the best estimate of the relationship between height and weight. The regression coefficient has a convenient interpretation: a one-unit increase in the independent variable (height) is associated with an increase of 4.5 units in the dependent variable (weight). For our data sample, this means that a 1-inch increase in height is associated with a 4.5 pound increase in weight. Thus, if we had no other information, our best guess for the weight of a person who is 5 feet 10 inches tall (70 inches) in the Changing Lives study would be –135 + 4.5 (70) = 180 pounds.

We want to fit a linear regression line. The question is, how do we decide which line is the best fitting one?

  1. Simple Linear Regression
  2. Ordinary Least Squares
  3. Gradient Descent
  4. Regularization

Least Square Method
In classical Linear Regression, we use the Least Square Method, which is fitted by minimizing the sum of the square of the residuals.

Image description

The residuals for an observation is the difference b/w the observation (y-value) and the fitted line. In above image, the residuals are marked by red line the difference b/w the true data points marked by blue and the model fitted black line. This video shows the behind the scene calculation for Least Mean Square method.

You can find detail about each of them from this blog.

We will be discussing and implementing Least Squares and Gradient Descent in future blog.

Goal with Linear Regression

Our goal with linear regression is to minimize the vertical distance (residuals) b/w all the data points and our line.

How we minimize the residuals?

There are a lot of different ways to minimize this

  1. Sum of Squared Errors
  2. Sum of absolute Errors
  3. etc.

All these methods have a general goal of minimizing the distance (residuals).

An important thing, while training model we don't care about minimizing loss (error) oneexample (data instance) but minimizing loss (error) across the entire dataset.

Multiple Regression (Multivariate Regression)

When we have more than one explanatory (independent) variable is involved. We will try to get Regression Coefficient for each explanatory variable included in the regression equation.

For instance, When we include multiple variables in the regression equation, the analysis gives us an estimate of the linear association between each explanatory variable and the dependent variable while holding other dependent variables constant, or “controlling for” these other factors. Let’s stick with weight for a while. We’ve found an association between height and weight; we know there are other factors that can help to explain weight (age, sex, diet, exercise, and so on).

Does we will still fit the a line for Multivariate Regression?

No, we can't just do it. Line is used to describe the relation between one independent and dependant variable. If we add one more independent variable our dimensionality of the problem will increase. So, for relation b/w two independent variable and a dependent variable. We will be fiting a Plane. Similarly,for relation b/w three independent variable and a dependent variable, we will be fitting a 3D space. With each independent variable (feature) our dimensionality will be increased. And, it will become difficult for us to visualize the dimensionality.

A more sophisticated regression equation might rely on multiple
features, each having a separate weight (w1, w2, etc.). For example, a regression equation that relies on three features might look as follows:

Enter fullscreen mode Exit fullscreen mode

Does the methodology is different for Multivariate Regression than Linear Regression?

No, it will remain same as it is for linear regression. Just the number of independent variable features will increase.

On which data we can apply the Regression?

One way to understand this question can be what we need to do to the data before applying regression on it. It also help us ensure that Regression is the right choice of Machine Learning technique for this dataset or not?

1. Linear Assumption

Make sure that the relation b/w your dependant variable and independent variable is linear. This may be obvious, but it is good to remember when you have a lot of attributes.

What exactly is meant by linear relation?

  1. Linear Relatinship b/w variable can be generally be represented and explained by a straight line on a scatter plot.

  2. If one varible increasing or decreasing the other variable is also increase or decreases.

In order to find more about relation look into following two good resources

  1. Relationship b/w variables
  2. Linear Vs Non-Linear, Monotonic relations

2. Remove Noise

Make sure there is no noise in the data.

Humans are prone to making mistakes when collecting data, and data collection instruments may be unreliable, resulting in dataset errors. The errors are referred to as noise.

There are two main types of noise.
1- Class Noise

If the class/label is not assigned correctly to instance/example of the dataset is called Class Noise.

  • Contradictory instances: The same instance appear more than once in the dataset and are labeled with different class lables.
  • Mislabeled instances: Instances are labeled with wrong class label. This type of errors is common in situations that different classes have similar symptoms.

2- Attribute Noise

In contrast to class noise, attribute noise reflects erroneous values for one or more attributes (independent variables) of the data set. Different types of attribute noises are there:

Three types of attribute noise are distinguished: erroneous attribute values, missing or don’t know values and incomplete or don’t care value.

This Research Paper is a fine read for learning more about noise.

3- Multicollineraity

Multicollinearity occurs when two or more independent variables(also known as predictor) are highly correlated with one another in a regression model.

The important thing is to note that this is the correlation b/w two independent variables not b/w dependant and independent variable.

What is the problem with Multicollinearity?


Here X1 and X2 are the independent variables. The mathematical significance of a1 is that if we shift our X1 variable by 1 unit then our Y shifts by a1 units keeping X2 and other things constant. Similarly, for a2 we have if we shift X2 by one unit means Y also shifts by one unit keeping X1 and other factors constant.

But for a situation where multicollinearity exists our independent variables are highly correlated, so if we change X1 then X2 also changes and we would not be able to see their Individual effect on Y which is our research objective for a regression model.

“ This makes the effects of X1 on Y difficult to differentiate from the effects of X2 on Y. ”

To find more Multilinearity this blog is a good read.

4- Gaussian Distribution

Linear regression will make more reliable predictions if your input and output variables have a Gaussian distribution. You may get some benefit using transforms (e.g. log or BoxCox) on you variables to make their distribution more Gaussian looking.

5- Rescale Distribution

Linear regression will often make more reliable predictions if you rescale input variables using standardization or normalization.

Remember from our scaling blog standardization also make the distribution of data Gaussian.

R Square

Our basic regression analysis produces one other statistic of note: the R Square, which is a measure of the total amount of variation explained by the regression equation.

Say we have a broad variation in our samples for weights. Many of the persons in the sample weigh more than the mean for the group overall; many weigh less. The R Square tells us how much of that variation around the mean is associated with differences in height alone. In case answer is .25, or 25 percent. The more significant point may be that 75 percent of the variation in weight for our sample remains unexplained. There are clearly factors other than height that might help us understand the weights of the participants.

Remember, an R2 of zero means that our regression equation does no better than the mean at predicting the weight of any individual in the sample; an R2 of 1 means that the regression equation perfectly predicts the weight of every person in the sample.

What next?

We looked into the basics of Regression which will act as a foundation when we will do the Regression Modeling in machine learning which we will cover in the next blog.

David Out.

Top comments (0)