Path Analysis in R

#webdev #ai #programming #tutorial

Have you ever tried building a model to predict a car’s mileage using different car attributes? If yes, you’ve likely faced a common question: Which factors should I include in my model?
A straightforward approach might be to take one parameter — say engine capacity — and build a simple regression model. While this might give you a rough estimate, it’s an oversimplification of a much more complex relationship. After all, a car’s mileage depends on multiple factors working together — horsepower, engine type, number of cylinders, weight, and even transmission type.
So, instead of relying on a single factor, let’s take a step further and build a model that incorporates multiple predictors to understand their combined influence on mileage.

From Simple Regression to Multiple Regression
In the first approach, where you consider only one predictor, you’re performing a simple linear regression.
In the second approach, you’re building a multiple linear regression model — one where several independent variables collectively explain the dependent variable (in this case, mileage).
At first glance, the second model seems better because it captures more real-world complexity. However, this introduces a new challenge: what if some of your independent variables are not truly independent?
For instance, mileage may depend on horsepower, but horsepower itself may depend on engine capacity and the number of cylinders. In such a case, the relationships among variables are interconnected.
This is where Path Analysis comes in.

What is Path Analysis?
Path Analysis is an extension of multiple regression that allows us to explore interdependent relationships among variables.
In simpler terms, it helps answer questions like:
“Does horsepower affect mileage directly, or does its effect pass through another variable, like weight?”
“Are there indirect effects between variables that influence the final outcome?”
Path analysis models these direct and indirect effects simultaneously, making it particularly useful for systems with multiple intermediate or dependent relationships.
In notation, it allows us to study cases where Z depends on Y, and Y depends on X — and we can quantify both the direct and indirect effects.
Earlier, this technique was referred to as “causal modeling.” However, statisticians later discouraged this term since statistical models cannot establish true causality. True causation can only be determined through controlled experimental designs. Path analysis can disprove causal assumptions but cannot prove them.

Key Terminologies in Path Analysis
Instead of the usual independent and dependent variables, path analysis introduces two new terms:
Exogenous Variables:
Variables that cause other variables but are not themselves influenced by any variables in the model. They have arrows starting from them, but none pointing toward them.
Endogenous Variables:
Variables that are influenced by other variables within the model. They have at least one arrow pointing toward them.
The logic behind this naming is simple — exogenous causes come from outside the system, while endogenous causes come from within the system.
In a typical path diagram:
X might be exogenous.
Y and Z might be endogenous.
A small “d” often represents disturbance terms, analogous to residuals in regression — unobserved factors influencing the model.

Assumptions in Path Analysis
Because path analysis extends multiple regression, many of its assumptions are similar:
Linearity:
Relationships among all variables must be linear.
Continuous Endogenous Variables:
Endogenous variables should be continuous. For ordinal data, at least five categories are recommended.
No Interaction Among Variables:
Variables should not interact. If they do, include a separate term representing the interaction.
Uncorrelated Disturbances:
Covariances among disturbance terms should be zero — i.e., the error terms are uncorrelated.
With the conceptual foundation in place, let’s move to implementation in R.

Implementing Path Analysis in R
We’ll use the following packages for our analysis:
install.packages("lavaan")
install.packages("OpenMx")
install.packages("semPlot")
install.packages("GGally")
install.packages("corrplot")

library(lavaan)
library(OpenMx)
library(semPlot)
library(GGally)
library(corrplot)

Creating a Custom Dataset
To build intuition, let’s start with a small dataset we create ourselves.
set.seed(11)
a = 0.5
b = 5
c = 7
d = 2.5

x1 = rnorm(20, mean = 0, sd = 1)
x2 = rnorm(20, mean = 0, sd = 1)
x3 = runif(20, min = 2, max = 5)

Y = a*x1 + b*x2
Z = c*x3 + d*Y

data1 = cbind(x1, x2, x3, Y, Z)
head(data1, n = 10)

We’ve created three predictors (x1, x2, x3), one intermediate variable (Y), and one final outcome (Z).
Let’s visualize correlations among these variables.
cor1 = cor(data1)
corrplot(cor1, method = 'square')

You’ll observe:
Y is strongly correlated with x2.
Z is strongly correlated with Y and x2.
x1 shows a weaker relationship.

Building the Path Model
model1 = '
Z ~ x1 + x2 + x3 + Y
Y ~ x1 + x2
'

fit1 = cfa(model1, data = data1)
summary(fit1, fit.measures = TRUE, standardized = TRUE, rsquare = TRUE)

The summary output provides parameter estimates and R² values.
You can visualize the model using:
semPaths(fit1, 'std', layout = 'circle')

The diagram shows:
Z is strongly dependent on Y.
Y is primarily influenced by x2.
The numbers on the arrows are path coefficients, similar to standardized beta coefficients in regression. Significant coefficients indicate strong relationships.

Real-World Example: The mtcars Dataset
Now that we understand the basics, let’s apply path analysis to a real dataset — mtcars.
data2 = mtcars
head(data2, n = 10)

Let’s model miles per gallon (mpg) as a function of several car attributes and explore how horsepower (hp) itself depends on other features.
model2 = '
mpg ~ hp + gear + cyl + disp + carb + am + wt
hp ~ cyl + disp + carb
'

fit2 = cfa(model2, data = data2)
summary(fit2)

From the results, we find:
Weight (wt) is a significant predictor of mpg.
Displacement (disp) and Carburetors (carb) significantly affect horsepower (hp).
Interestingly, hp is not a strong predictor of mpg, suggesting that power alone doesn’t dictate fuel efficiency.
Visualize the relationships:
semPaths(fit2, 'std', 'est', curveAdjacent = TRUE, style = "lisrel")

The resulting diagram reinforces our earlier findings:
mpg is heavily influenced by wt.
hp is largely driven by disp and carb.
The link between hp and mpg is weak — consistent with the summary statistics.

Interpreting Path Analysis Results
Path analysis helps us quantify both direct and indirect relationships.
For instance, displacement indirectly influences mileage through its impact on horsepower.
However, one must remember:
Path analysis is not a discovery tool — it’s used to test a theoretical model you already believe in.
Adding or omitting variables can dramatically alter results.
It’s best used to compare models and see which structure fits the data better.

Final Thoughts
Path analysis provides a powerful framework to explore interdependent systems — where variables influence each other directly and indirectly.
It bridges the gap between simple regression models and more complex Structural Equation Modeling (SEM) techniques. When used thoughtfully, it reveals the subtle interplay of factors driving outcomes in economics, psychology, marketing, and engineering.
However, remember that correlation is not causation. Use path analysis to test and refine your understanding of relationships, not to infer causality.
Have you tried using path analysis in your projects?
Share your experiences or examples in the comments — we’d love to hear how you’ve applied it!
At Perceptive Analytics, we help businesses turn data into a strategic advantage. As a trusted Power BI consulting company, we specialize in building powerful dashboards, automating reporting, and delivering insights that drive smarter decisions. Our experienced Tableau Consultants craft intuitive visualizations and analytics solutions that empower organizations to uncover trends, track performance, and act with confidence. Together, we help enterprises harness data to accelerate growth.

DEV Community

Path Analysis in R

Top comments (0)