DEV Community

Dipti Moryani
Dipti Moryani

Posted on

Implementing Log-linear Regression in R

In this article, we discuss widely used Generalized Linear Models (GLMs) in the industry, focusing on:
Log-linear regression
Interpreting log-transformations
Binary logistic regression
We also review the underlying distributions and applicable link functions. To illustrate concepts in R, we use the sample datasets:
cola.csv
penalty.csv

Introduction to Generalized Linear Models
A Generalized Linear Model (GLM) represents the dependent variable as a function of independent variables. The traditional form of GLM is simple linear regression, which assumes that the dependent variable is normally distributed.
However, in real-world scenarios, this assumption is often violated. For example, if the dependent variable is the number of coffee cups sold (always positive and often with a fat-tailed distribution) and the independent variable is temperature, simple linear regression may produce nonsensical results — e.g., predicting negative sales at low temperatures.
GLMs overcome these limitations by applying a link function to the dependent variable, allowing the model to accommodate distributions other than normal (e.g., Poisson, Bernoulli).

Linear Regression in R
Simple linear regression models a linear relationship:
Yi=α+βXiY_i = \alpha + \beta X_iYi​=α+βXi​
The coefficients are computed using Ordinary Least Squares (OLS).
Example: Cola Sales vs Temperature

Read data

data = read.csv("Cola.csv", header = TRUE)
head(data)

Scatter plot

plot(data, main = "Scatter Plot")

Install hydroGOF for RMSE

install.packages("hydroGOF")
library(hydroGOF)

Fit linear model

model = lm(Cola ~ Temperature, data)

Overlay best-fit line

abline(model)

Calculate RMSE

PredCola = predict(model, data)
RMSE = rmse(PredCola, data$Cola)

The simple linear regression fit is poor, with an RMSE of 241.49. The model predicts negative sales for low temperatures — highlighting the limitations of linear regression when the dependent variable has a non-normal distribution.

Log-linear Regression
Log-linear regression is useful when the dependent variable grows exponentially with the independent variable. For instance, expected salary vs. education or compound interest over time.
The model form is:
Y=a⋅bXY = a \cdot b^XY=a⋅bX
Taking the logarithm of both sides:
log⁡(Y)=log⁡(a)+(log⁡b)⋅X\log(Y) = \log(a) + (\log b) \cdot Xlog(Y)=log(a)+(logb)⋅X
Now the relationship is linear and can be estimated using OLS. Here, log⁡(a)\log(a)log(a) is the intercept, and log⁡(b)\log(b)log(b) represents the growth rate.
Implementing Log-linear Regression in R

Transform dependent variable

data$LCola = log(data$Cola)

Scatter plot

plot(LCola ~ Temperature, data = data, main = "Scatter Plot")

Fit linear model on transformed data

model1 = lm(LCola ~ Temperature, data)
abline(model1)

Calculate RMSE

PredCola1 = predict(model1, data)
RMSE = rmse(PredCola1, data$LCola)

The RMSE drops dramatically to 0.24, and predictions are now meaningful (no negative sales).

Interpreting Log Transformations
Log transformations help model non-linear relationships using linear techniques.
Log-linear: Log-transform dependent variable; independent variable remains unchanged.
Linear-log: Log-transform independent variable; dependent variable remains unchanged.
Log-log: Log-transform both dependent and independent variables.
Model TypeEquationInterpretation
Log-linear
log(Y) = α + βX
Y changes by a constant percentage per unit change in X
Linear-log
Y = α + β log(X)
Y changes by β units per percentage change in X
Log-log
log(Y) = α + β log(X)
Y changes by β% per 1% change in X

Binary Logistic Regression
Binary logistic regression is used when the dependent variable is categorical (0 or 1). The conditional distribution is Bernoulli.
Example: Predicting success in a football penalty based on practice hours.

Read data

data1 = read.csv("Penalty.csv", header = TRUE)
head(data1)

Scatter plot

plot(data1, main = "Scatter Plot")

Fitting a Logistic Regression Model
fit = glm(Outcome ~ Practice, family = binomial(link = "logit"), data = data1)

Plot predicted probabilities

plot(data1, main = "Scatter Plot")
curve(predict(fit, data.frame(Practice = x), type = "resp"), add = TRUE)
points(data1$Practice, fitted(fit), pch = 20)

The probability of success increases with practice hours. A positive β1\beta_1β1​ indicates that probability of success increases as X increases.

Conclusion
GLMs generalize linear regression to handle non-normal dependent variables and non-linear relationships.
Log-linear regression is useful for exponential relationships, log-normal distributions, and Poisson counts.
Binary logistic regression models probabilities for 0/1 outcomes using the logistic function.
R implementations are straightforward, and data transformations help resolve issues like negative predictions or non-linear growth.
Along with theory, we provided commented R code to implement these models on sample datasets.
We hope this article helps you understand and implement GLMs in real-world scenarios.

If you want, I can create a companion RMarkdown notebook for this article that includes:
All code chunks runnable in R
Plots for Cola sales and penalty data
Step-by-step interpretation of results
This makes it ready for sharing or teaching purposes.
At Perceptive Analytics, our mission is “to enable businesses to unlock value in data.” For over 20 years, we’ve partnered with more than 100 clients—from Fortune 500 companies to mid-sized firms—to solve complex data analytics challenges. Our services include tableau consulting services and AI consulting services, turning data into strategic insight. We would love to talk to you. Do reach out to us.

Top comments (0)