Imagine you are tasked with predicting the tourism revenue for a country, such as India. Your primary goal is to estimate the total revenue earned in a year (in USD). In this scenario, this revenue serves as your dependent variable, also known as the response variable. But what about the factors that influence this revenue—the independent or predictor variables?
Suppose you are provided with two different sets of predictor variables. Your job is to choose the set that will give the most accurate prediction of tourism revenue.
Understanding Predictor Variables
The first set of variables includes:
X1: Total number of tourists visiting the country
X2: Government spending on tourism marketing
X3: A linear combination of the first two variables, i.e., X3 = aX1 + bX2 + c, where a, b, and c are constants
The second set of variables includes:
X1: Total number of tourists visiting the country
X2: Government spending on tourism marketing
X3: Average currency exchange rate
Intuitively, which set do you think would provide more meaningful information for predicting tourism revenue? Most would agree that the second set is preferable. Here’s why: each variable in the second set provides unique information, and none of the variables is derived from another. In contrast, the first set contains a variable (X3) that is a linear combination of X1 and X2, which does not add any new information.
This situation illustrates the problem of multicollinearity. In the first set, the predictor variables are highly correlated with each other. Such correlation can obscure the true relationship between predictors and the response variable, leading to inaccurate or unreliable model results.
What is Multicollinearity?
According to Wikipedia, “Collinearity is a linear association between two explanatory variables. Two variables are perfectly collinear if there is an exact linear relationship between them. Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related.”
To put it simply, perfect multicollinearity occurs when one independent variable is an exact linear combination of other variables. For example, if you already have X and Y as predictors and add a new variable Z = aX + bY, Z does not provide any additional predictive power. The regression model can internally adjust the coefficients to account for this linear combination.
Multicollinearity can arise from several sources:
Inclusion of derived variables: As in our first set example, variables computed from other predictors can introduce multicollinearity.
Similar variables: Variables that capture almost the same information, such as age and experience in a workforce dataset, may be highly correlated.
Incorrect use of dummy variables: Using dummy variables incorrectly, especially when categories overlap or are linearly dependent, can lead to multicollinearity.
Why Multicollinearity Matters
While multicollinearity does not always affect the overall fit of the model, it can severely impact individual variables and their interpretability. Some challenges include:
Difficulty identifying statistically significant variables
Inflated standard errors, leading to less precise coefficient estimates
Sensitivity to changes in the dataset, such as adding or removing variables
Wider confidence intervals, making it harder to reject the null hypothesis
In short, multicollinearity can make it unclear which variables truly drive the response variable, and it can distort the apparent impact of independent variables in your model.
Detecting Multicollinearity
Several techniques help identify multicollinearity in your dataset:
Pairwise Correlation Plot: A simple first step is to visualize correlations among variables. High correlations may indicate multicollinearity.
Regression Coefficient Sensitivity: Large changes in regression coefficients when adding or removing variables, or variations across different samples, suggest multicollinearity.
Variance Inflation Factor (VIF): VIF quantifies the severity of multicollinearity for each predictor:
𝑉
𝐼
𝐹
1
1
−
𝑅
2
VIF=
1−R
2
1
Typically, a VIF > 10 signals high multicollinearity, while a VIF < 4 is considered safe.
Farrar-Glauber Test: A more formal approach involves three tests:
Chi-square test: Detects presence of multicollinearity
F-test: Identifies which variables are collinear
t-test: Determines the pattern of multicollinearity
Implementing Multicollinearity Detection in R
To illustrate, we’ll use the CPS_85_Wages dataset, which contains a random sample of 534 individuals from the Current Population Survey (CPS), including their wages and other characteristics.
data1 = read.csv(file.choose(), header = TRUE)
head(data1)
str(data1)
Variables include Education, South, Sex, Experience, Union, Wage, Age, Race, Occupation, Sector, and Marr.
We can fit a linear regression model predicting wages:
fit_model1 = lm(log(Wage) ~ ., data = data1)
summary(fit_model1)
While the model may have a significant F-statistic, several individual variables—such as Education, Experience, Age, and Occupation—are not statistically significant, hinting at potential multicollinearity.
Pairwise Correlation
library(corrplot)
cor1 = cor(data1)
corrplot.mixed(cor1, lower.col = "black", number.cex = 0.7)
The plot may reveal a high correlation between Experience and Age, indicating multicollinearity.
Farrar-Glauber Test
Using the mctest package:
install.packages('mctest')
library(mctest)
omcdiag(data1[,c(1:5,7:11)], data1$Wage)
imcdiag(data1[,c(1:5,7:11)], data1$Wage)
The test confirms multicollinearity, particularly among Education, Experience, and Age, with very high VIF values. Partial correlation tests further support this observation.
Addressing Multicollinearity
There are multiple approaches to tackle multicollinearity:
Removing problematic variables: Drop variables with very high VIF (e.g., >10). In our case, either Age or Experience could be removed.
Ridge Regression: Adds a penalty to coefficients to reduce multicollinearity impact.
Principal Component Regression (PCR): Transforms correlated variables into uncorrelated components.
Partial Least Squares Regression (PLSR): Another method to handle highly correlated predictors while preserving predictive power.
After removing or adjusting variables, refit the model and check if the coefficients and statistical significance improve. Iterative refinement often yields a more reliable and interpretable model.
Conclusion
Multicollinearity is a subtle but critical issue in regression analysis. Ignoring it can lead to misleading interpretations and unreliable predictions. By carefully diagnosing multicollinearity using correlation analysis, VIF, and formal statistical tests like Farrar-Glauber, and addressing it through variable selection or advanced regression techniques, analysts can build models that are both accurate and interpretable.
Next time you develop a regression model in R, pay close attention to the relationships among your predictors—your model’s reliability may depend on it.
This article was originally published on Perceptive Analytics.
In United States, our mission is simple — to enable businesses to unlock value in data. For over 20 years, we’ve partnered with more than 100 clients — from Fortune 500 companies to mid-sized firms — helping them solve complex data analytics challenges. As a leading Tableau Contractor in San Francisco, Tableau Contractor in San Jose and Tableau Contractor in Seattle we turn raw data into strategic insights that drive better decisions.
Top comments (0)