Unraveling Multicollinearity: Causes, Implications, and Solutions

#python #programming #datascience #machinelearning

What is multicollinearity

Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated, leading to issues in estimating individual coefficients. This tangled web can introduce numerical instability, making it tricky to pinpoint the true relationship between predictors and the target variable.

But wait, how does it create an issue in estimating individual coefficients?

Why multicollinearity is bad

Suppose this is the equation for knowing the accuracy of the model

Here y is the dependent variable and x1 and x2 are independent variables where β1 and β2 represent the weight of the x1 and x2 respectively which simply means in determining y how much weight does x1 possess. So if we want to see how much y depends upon x1 we can calculate β1 by keeping other variables constant.

Now if the two variables x1 and x2 are correlated then we cannot isolate x1 as it depends upon x2 and hence we will have a problem in calculating the value of β1.

So in place, we need to find out how our target depends upon each feature having multicollinearity in a data is very bad.

How to remove multicollinearity

We can apply feature selection. Where we identify which columns are important and drop the rest.
We can apply the regression technique which introduces a penalty term which is used to shrink the coefficient.
Calculate VIF (Variance Inflation Factor). Variables with high VIF values (typically above 5 or 10) may indicate multicollinearity.
We can plot a scatter plot and by looking at the graph we can tell which data are correlated.
We can increase the size of the data. Sometimes, increasing the size of your dataset can help alleviate multicollinearity issues. More data can provide a more representative sample of the population, leading to more stable coefficient estimates.

Code

np.random.seed(42)

# These are correlated data
x1 = np.random.rand(100)
x2 = 0.8 * x1 + 0.2 * np.random.rand(100)
x3 = 0.7 * x1 + 0.3 * np.random.rand(100)
x4 = 0.6 * x1 + 0.4 * np.random.rand(100)
x5 = 0.5 * x1 + 0.5 * np.random.rand(100)

# These are uncorrelated data
x6 = np.random.rand(100)
x7 = np.random.rand(100)
x8 = np.random.rand(100)
x9 = np.random.rand(100)
x10 = np.random.rand(100)

Let's see if we plot the graph between two correlated data how it going to look

plt.scatter(x1, x2)

and if we plot the graph between the uncorrelated data

plt.scatter(x6, x7)

From the figure, we can see which columns are correlated and if we want to drop the correlated columns we can easily do that

Now let's see what a correlation matrix is going to tell us

# First we need to make our variable into a data frame
data = pd.DataFrame({
    "x1": x1,
    "x2": x2,
    "x3": x3,
    "x4": x4,
    "x5": x5,
    "x6": x6,
    "x7": x7,
    "x8": x8,
    "x9": x9,
    "x10":x10
})

matrix = data.corr()
matrix

Here, a high value represents a very large correlation. For example, the x2 and x1 correlate 0.97 which is very high. The negative value represents the negative correlation.

We can also see this correlation using the heatmap provided by the seaborn.

import seaborn as sns
plt.figure(figsize=(10, 7))
sns.heatmap(corre, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()

If the data is very large it will be very difficult to tell which columns are correlated. So let's build a function which takes threshold (amount of correlation) as a parameter and returns the column name

def correlation(data, threshold):

    # so that we don't get duplicate column
    correlated_col = set()

    correlated_matrix = data.corr()
    for i in range(len(correlated_matrix.columns)):
        for j in range(i):

            # if the correlation is greater than the threshold
            if abs(correlated_matrix.iloc[i, j]) > threshold:
                related_col_name = correlated_matrix.columns[i]
                correlated_col.add(related_col_name)

    return correlated_col

correlation(data, 0.9)
Output
{'x2', 'x3'}
and
correlation(data, 0.7)
Output
{'x2', 'x3', 'x4', 'x5'}

Now if we want to drop the correlated column we can easily do that.

Conclusion

By understanding and addressing multicollinearity, we can enhance the robustness and interpretability of our models, ensuring more accurate insights for data-driven decision-making.