Introduction
Statistical modeling is at the heart of data science and applied research. For many analysts, linear regression is the first tool used to describe the relationship between a dependent variable and one or more independent variables. While linear regression works well when the dependent variable is normally distributed, real-world data often does not follow such neat assumptions. Sales figures, customer churn, disease progression, and click-through rates rarely fit into a simple bell-shaped curve.
This is where Generalized Linear Models (GLMs) come into play. GLMs extend linear regression to handle a wide variety of distributions and link functions, making them highly versatile. They allow analysts to model dependent variables that are not normally distributed—whether that’s count data, proportions, or categorical outcomes.
In this article, we will:
Review simple linear regression as a basic GLM.
Explore log-linear regression and understand why log-transformations are powerful.
Implement binary logistic regression in R for classification problems.
By the end, you will not only understand the theoretical foundation of GLMs but also learn to implement them using R’s built-in functions.
Revisiting the Basics: Linear Regression as a GLM
Linear regression is the simplest form of a GLM. It assumes that:
The dependent variable follows a normal distribution.
The relationship between predictors and the outcome is linear.
The standard linear regression model is expressed as:
𝑌
𝑖
𝛼
+
𝛽
𝑋
𝑖
Y
i
=α+βX
i
where:
𝑌
𝑖
Y
i
is the dependent variable,
𝑋
𝑖
X
i
is the independent variable,
𝛼
α is the intercept, and
𝛽
β is the slope.
The coefficients are estimated using the Ordinary Least Squares (OLS) method. However, when the dependent variable does not behave normally—for example, when it’s always positive or represents counts—linear regression may produce nonsensical results such as negative predictions.
Example: Coca-Cola Sales vs. Temperature
Consider a dataset with two variables: temperature and Coca-Cola sales on a university campus. As expected, cola sales increase with temperature, but the relationship is not linear—it’s exponential.
In R, we can visualize and model this:
Read data
data <- read.csv("Cola.csv", header = TRUE)
plot(data, main = "Scatter Plot: Temperature vs. Cola Sales")
Fit linear model
model <- lm(Cola ~ Temperature, data)
abline(model)
Calculate RMSE
library(hydroGOF)
PredCola <- predict(model, data)
rmse(PredCola, data$Cola)
The linear model gives a high RMSE (~241.49) and even predicts negative cola sales at low temperatures—an unrealistic outcome. This illustrates why linear regression is not always the best option.
Log-Linear Regression: Handling Exponential Growth
When the dependent variable grows (or decays) exponentially with the independent variable, log-linear regression is a better choice.
For instance:
Compound interest grows exponentially with time.
Salaries tend to grow exponentially with years of education.
Sales often increase exponentially with temperature or advertising spend.
Mathematically, such a relationship is expressed as:
𝑌
𝑎
⋅
𝑏
𝑋
Y=a⋅b
X
Taking logarithms on both sides:
log
(
𝑌
)
log
(
𝑎
)
+
(
log
(
𝑏
)
)
⋅
𝑋
log(Y)=log(a)+(log(b))⋅X
Now the relationship becomes linear, allowing us to apply OLS.
Example in R: Log-Linear Model for Cola Sales
Transform dependent variable
data$LCola <- log(data$Cola)
Fit log-linear model
model1 <- lm(LCola ~ Temperature, data)
abline(model1)
Calculate RMSE
PredCola1 <- predict(model1, data)
rmse(PredCola1, data$LCola)
The RMSE drops dramatically (to ~0.24), and predictions no longer produce negative values. This simple transformation turns an unrealistic model into a reliable one.
Interpreting Log Transformations
Log transformations are widely used in regression because they:
Stabilize variance in skewed data.
Convert exponential growth into linear growth.
Handle multiplicative relationships between variables.
There are three main types of log-based models:
Log-linear regression – dependent variable is log-transformed.
Linear-log regression – independent variables are log-transformed.
Log-log regression – both dependent and independent variables are log-transformed.
Each has different interpretations:
In log-linear models, coefficients represent percentage changes in the dependent variable.
In linear-log models, coefficients indicate changes in the dependent variable for percentage changes in predictors.
In log-log models, coefficients represent elasticities—percentage change in Y for a percentage change in X.
These interpretations make log transformations especially valuable in fields like economics, marketing, and finance.
Binary Logistic Regression: Modeling Categorical Outcomes
Sometimes the dependent variable is not continuous at all, but categorical—often taking only two values (e.g., yes/no, success/failure, purchase/no purchase). In such cases, linear regression fails because it can predict probabilities outside the range of 0–1.
Binary logistic regression solves this by modeling the probability of an event occurring using the logistic (sigmoid) function:
𝑃
(
𝑌
1
∣
𝑋
)
1
1
+
𝑒
−
(
𝛼
+
𝛽
𝑋
)
P(Y=1∣X)=
1+e
−(α+βX)
1
This ensures probabilities always lie between 0 and 1.
Example: Football Penalty Success
Suppose we want to model whether a football player scores a penalty (1) or misses (0), based on hours of practice. The data includes binary outcomes and practice hours.
Read data
data1 <- read.csv("Penalty.csv", header = TRUE)
plot(data1, main = "Penalty Data: Practice Hours vs. Outcome")
Fit logistic regression
fit <- glm(Outcome ~ Practice, family = binomial(link = "logit"), data = data1)
Visualize predictions
curve(predict(fit, data.frame(Practice = x), type = "response"), add = TRUE)
points(data1$Practice, fitted(fit), pch = 20)
The model outputs probabilities instead of absolute predictions. As practice hours increase, the probability of scoring increases, which matches real-world expectations. Logistic regression thus provides a powerful way to handle binary outcomes.
Beyond Binary: Extensions of Logistic Regression
While binary logistic regression models yes/no outcomes, real-world data often requires more complex versions:
Multinomial Logistic Regression – when there are more than two categories (e.g., predicting which coupon a customer redeems: A, B, or C).
Ordinal Logistic Regression – when categories have a natural order (e.g., customer satisfaction: poor, fair, good, excellent).
These extensions further expand the applicability of GLMs in business analytics, healthcare, and social sciences.
Why GLMs Are So Powerful
The real strength of GLMs lies in their flexibility. They allow us to:
Model non-normal dependent variables (counts, binary, skewed data).
Use link functions (log, logit, probit, etc.) to make the data fit a linear relationship.
Improve interpretability by transforming nonlinear growth patterns into linear trends.
Whether predicting cola sales, interest growth, or football performance, GLMs provide statistically sound and interpretable models.
Conclusion
In this article, we explored how Generalized Linear Models (GLMs) extend the traditional linear regression framework to handle a broader range of data distributions and relationships.
We revisited linear regression as the basic GLM and observed its limitations with non-normal data.
We applied log-linear regression to handle exponential growth, showing how a simple transformation can greatly improve model accuracy.
We implemented binary logistic regression in R to model categorical outcomes, emphasizing its usefulness in probability estimation.
GLMs are among the most practical and widely used tools in data science. With R’s built-in support (glm() function), implementing them is straightforward. Whether you are working with sales forecasting, risk modeling, medical data, or classification problems, GLMs provide a solid statistical foundation for accurate and meaningful predictions.
This article was originally published on Perceptive Analytics.
In United States, our mission is simple — to enable businesses to unlock value in data. For over 20 years, we’ve partnered with more than 100 clients — from Fortune 500 companies to mid-sized firms — helping them solve complex data analytics challenges. As a leading excel specialist, we turn raw data into strategic insights that drive better decisions.
Top comments (0)