DEV Community


Posted on

Correlation for feature selection

Correlation measures the degree to which two phenomena are related to one another. For example, there is a correlation between summer temperatures and ice cream sales. When one goes up, so does the other. Two variables are positively correlated if a change in one is associated with a change in the other in the same direction, such as the relationship between height and weight. Taller people weigh more (on average); shorter people weigh less. A correlation is negative if a positive change in one variable is associated with a negative change in the other, such as the relationship between exercise and weight.

Correlation Range and interpretation

The power of correlation as a statistical tool is that we can encapsulate (to express or show the most important facts about something ) an association between two variables in a single descriptive statistic: the correlation coefficient. The correlation coefficient has two fabulously attractive characteristics. First, for math reasons, it is a single number ranging from –1 to 1. A correlation of 1, often described as perfect correlation, means that every change in one variable is associated with an equivalent change in the other variable in the same direction. A correlation of –1, or perfect negative correlation, means that every change in one variable is associated with an equivalent change in the other variable in the opposite direction. The closer the correlation is to 1 or –1, the stronger the association. The closer the correlation (correlation coefficient) is to 0 lesser is there an association b/w variables.

Which type of Correlation is meant when we use simply the word Correlation?

Depending upon the use it can be any of the three types. Usually when we talk about Correlation it's Pearson Correlation.

Typical use-case of Correlation

Typically, researchers specify (either explicitly or implicitly) two hypotheses in relation to a correlation analysis: a null hypothesis and an alternative hypothesis.

Null Hypothesis (H0): There is no association between the two variables of interest.

Alternative Hypothesis (H1): There is an association between two variables of interest.

Researchers conduct inferential statistics in order to determine whether they can reject the null hypothesis or not. If the statistical analysis is associated with a p < .05, they tend to reject the null hypothesis and suggest that there is evidence in favour of the alternative hypothesis.

We will talk about P-value in another Blog.

Types of Correlation

There are three types of Correlation

  1. Pearson Correlation
  2. Spear Rank Correlation
  3. Kendell's Tau Correlation

Pearson Correlation

It is also known as Standardised Index and Pearson'r.

"It quantifies the degree of linear association between two variables which are assumed to be measured on an interval/ratio scale."

To find about different scales checkout this video.

Positive & Negative Pearson's Correlation

A positive correlation implies that as the numerical value of one variable increases, the numerical value of another variable also increases. For example, the correlation between height and weight is positive in nature: taller people tend to be heavier than shorter people. It’s a tendency, not a 1 to 1 association. By contrast, a negative correlation implies that as the numerical value of one variable increases, the numerical value of another variable decreases. For example, the outside temperature and the amount of clothes people wear: Hotter days tend to be associated with less clothes worn.

Range for Pearson's Correlation

Theoretically, values of Pearson’s r can range in magnitude from 1.0 to -1.0. Most correlations reported in scientific papers are somewhere between |.10| and |.70| (or -.10 and -.70). It is rare to see a correlation larger than |.70|. A Pearson r value of .00 implies the total absence of an association between two variables. It is also relatively uncommon to observe a correlation of exactly .00.

Calculating Pearson Correlation

In order to calculate Pearson Correlation you need to have the understanding of Z-score.

There are four steps to the calculation a Pearson correlation

1: Convert the raw scores into z-scores
2: Multiply each cases corresponding z-scores
3: Sum the multiplied z-scores across all cases
4: Divided the sum of product z-scores by N – 1

Here, N is the total number of instances in the dataset.

Interpretation Of Pearson's Correlation

Cohen (1992) published some extremely popular guidelines for interpreting the magnitude of a Pearson correlation: |.10| = small; |.30| = medium, and |.50| = large. These values were suggested based on Cohen’s experience reading published scientific papers. Expressed as coefficients of determination, Cohen’s guidelines correspond to .01, .09, and .25; or 1%, 9% and 25% shared variance. Based on Cohen’s (1992) guidelines, the estimated correlation of .338 between years of education completed and earnings would be considered a medium sized correlation.

When to use Pearson's Correlation for feature selection in Machine Learning?

1- Strictly speaking, a Pearson correlation assumes the independent and dependent variables have been measured on a continuous scale.

The scale can be ratio, interval or even ordinal (As long as it has at-least 5 points).

For instance, say we have a classification problem and we have numerical features. So, should we calculate the Pearson's correlation b/w features and target dataset?

I believer no. The reason being our target value has a discrete scale while our features have a continuous scale. However, we can check the collinearity b/w the continuous features using the Pearson's correlation.

2- Many sources state that the Pearson correlation assumes a linear association between the independent and dependent variables.

I would prefer to say that the Pearson correlation is limited in that it can only quantify linear associations. So, it’s a limitation, not an assumption, per se. However, it is always good practice to examine scatter plots to help identify non-linearity (a.k.a., curvilinearity) in the data.

But sometimes we also use Pearson correlation as a measure of correlation b/w two features.

3- Theoretically, the Pearson correlation assumes perfectly normally distributed data.

However, based on empirical simulation research, it has been discovered that the Pearson correlation is fairly robust to violations of the theoretical assumption of normality. Based on my reading of the simulation research (Bishara & Hittner, 2012; Edgell & Noon, 1984; Havlicek, & Peterson, 1977), normal theory estimation (“regular” p-values) will provide respectably accurate p-values, when the data are skewed less than |2.0| and the kurtosis is less than |9.0|.

In case if you are trying to find a correlation b/w features of a dataset that is not normally distributed, may be you can first do the feature scaling may be using StandardScaler of scikit-learn library and then calculate the correlation.

Spear Rank Correlation

The Spearman rank correlation (rS) is simply a Pearson correlation applied to ranked data. The data can be originally ranked by the participants in the investigation.

Alternatively, the originally non-ranked data provided by participants can be ranked by the researcher, after the data have been collected. As the Spearman correlation is based on ranked data, it does not assume normally distributed data. In fact, the Spearman correlation can handle any level of non-normality, unlike the Pearson correlation.

In my opinion, the Spearman rank correlation is much too frequently applied, because researchers automatically turn to the Spearman correlation, when the data are perceived to be excessively skewed.

May be we can use Spear Rank Correlation for feature selection in case we have some features representing RANKED data and the dataset is also not normally distributed.

Kendell's Tau Correlation

It is also use to measure the concordance (agreement) b/w two ranked columns just like Spear Rank Correlation.

It also have range of -1 to +1.

For more information abiout Kendell's Correlatoin. Do watch these videos.

Calculating Correlation using Pandas

We can calculate all three types of correlation using Pandas with one line of code for each type.

You can find the exact document here.

Step 1- Reading your dataset

import pandas as pd
dataset = pd.read_csv('mydataset.csv')
Enter fullscreen mode Exit fullscreen mode

Step 2- Calculating Pearson Correlation

Enter fullscreen mode Exit fullscreen mode

Step 3- Calculating Spear Rank Correlation

Enter fullscreen mode Exit fullscreen mode

Step 4- Calculating Kendalls Correlation

Enter fullscreen mode Exit fullscreen mode

Step 5- Dropping features with high Correlation

You need to decide the threshold for deciding highly correlated features typically we consider 0.8. But for the sake of this problem we will drop one of the feature of the two feature which have a correlation of more than 0.95 between them in order to remove the collinearity.

#feature reduction 
#dropping very high correlated features 
corr_matrix = dataset.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find features with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

# Drop features 
dataset =  dataset.drop(to_drop, axis=1)
Enter fullscreen mode Exit fullscreen mode

Things to consider

1- Which type of correlation should I use?

May be you want to use different correlation for instance you may want to find Pearson's Correlation b/w your continuous features and Spear's Rand Correlation b/w your Ranked features.

2- Make sure to remove unwanted feature.

In case if you are using Pearson's correlation make sure to drop categorical features. Depending upon the correlation type you are using make sure you are giving the right type of data.

3- Consider data distribution and relation b/w variable/feature before choosing the correlation type and applying.

For instance if you have a normally distributed data you better go of with the Pearson's Correlation and similarly in case of data is not normally distributed it is better to use Spear Rank's or kendal's Correlation.

Top comments (1)

miguelmj profile image

Thank you!