What is multicollinearity
Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated, leading to issues in estimating individual coefficients. This tangled web can introduce numerical instability, making it tricky to pinpoint the true relationship between predictors and the target variable.
But wait, how does it create an issue in estimating individual coefficients?
Why multicollinearity is bad
Suppose this is the equation for knowing the accuracy of the model
Here y
is the dependent variable and x1
and x2
are independent variables where β1
and β2
represent the weight of the x1
and x2
respectively which simply means in determining y
how much weight does x1
possess. So if we want to see how much y
depends upon x1
we can calculate β1
by keeping other variables constant.
Now if the two variables x1
and x2
are correlated then we cannot isolate x1
as it depends upon x2
and hence we will have a problem in calculating the value of β1
.
So in place, we need to find out how our target depends upon each feature having multicollinearity in a data is very bad.
How to remove multicollinearity
We can apply feature selection. Where we identify which columns are important and drop the rest.
We can apply the regression technique which introduces a penalty term which is used to shrink the coefficient.
Calculate VIF (Variance Inflation Factor). Variables with high VIF values (typically above 5 or 10) may indicate multicollinearity.
We can plot a scatter plot and by looking at the graph we can tell which data are correlated.
We can increase the size of the data. Sometimes, increasing the size of your dataset can help alleviate multicollinearity issues. More data can provide a more representative sample of the population, leading to more stable coefficient estimates.
Code
np.random.seed(42)
# These are correlated data
x1 = np.random.rand(100)
x2 = 0.8 * x1 + 0.2 * np.random.rand(100)
x3 = 0.7 * x1 + 0.3 * np.random.rand(100)
x4 = 0.6 * x1 + 0.4 * np.random.rand(100)
x5 = 0.5 * x1 + 0.5 * np.random.rand(100)
# These are uncorrelated data
x6 = np.random.rand(100)
x7 = np.random.rand(100)
x8 = np.random.rand(100)
x9 = np.random.rand(100)
x10 = np.random.rand(100)
Let's see if we plot the graph between two correlated data how it going to look
plt.scatter(x1, x2)
and if we plot the graph between the uncorrelated data
plt.scatter(x6, x7)
From the figure, we can see which columns are correlated and if we want to drop the correlated columns we can easily do that
Now let's see what a correlation matrix is going to tell us
# First we need to make our variable into a data frame
data = pd.DataFrame({
"x1": x1,
"x2": x2,
"x3": x3,
"x4": x4,
"x5": x5,
"x6": x6,
"x7": x7,
"x8": x8,
"x9": x9,
"x10":x10
})
matrix = data.corr()
matrix
Here, a high value represents a very large correlation. For example, the x2 and x1 correlate 0.97 which is very high. The negative value represents the negative correlation.
We can also see this correlation using the heatmap provided by the seaborn.
import seaborn as sns
plt.figure(figsize=(10, 7))
sns.heatmap(corre, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()
If the data is very large it will be very difficult to tell which columns are correlated. So let's build a function which takes threshold (amount of correlation) as a parameter and returns the column name
def correlation(data, threshold):
# so that we don't get duplicate column
correlated_col = set()
correlated_matrix = data.corr()
for i in range(len(correlated_matrix.columns)):
for j in range(i):
# if the correlation is greater than the threshold
if abs(correlated_matrix.iloc[i, j]) > threshold:
related_col_name = correlated_matrix.columns[i]
correlated_col.add(related_col_name)
return correlated_col
correlation(data, 0.9)
Output
{'x2', 'x3'}
and
correlation(data, 0.7)
Output
{'x2', 'x3', 'x4', 'x5'}
Now if we want to drop the correlated column we can easily do that.
Conclusion
By understanding and addressing multicollinearity, we can enhance the robustness and interpretability of our models, ensuring more accurate insights for data-driven decision-making.
Top comments (0)