peisen yan

Posted on May 20

Scatter Plots with Linear Regression in R (with Multi-Panel Layouts)

#r #tutorial #datascience #ecology

The Problem

You have environmental data (pH, organic matter, nutrients) and biodiversity data (Shannon index). You want to know: which environmental factors significantly affect diversity?

The quickest way to answer this is scatter plots + linear regression.

Here's a complete, runnable R workflow using real soil microbiome data — from a simple lm() call to a multi-panel publication-ready figure. Everything uses base R graphics, no ggplot2 required.

Data

This dataset has 24 soil samples with 6 environmental factors and 2 diversity indices:

Variable	Description	Unit
pH	Soil acidity	—
AK	Available potassium	mg/kg
AP	Available phosphorus	mg/kg
AN	Available nitrogen	mg/kg
OM	Organic matter	g/kg
W	Water content	%
Shannon	Species diversity	—
PD	Phylogenetic diversity	—

# Read and inspect
df <- read.csv("sandian.csv", header = TRUE)
str(df)
head(df)

Single Factor: pH → Shannon

x <- df$pH
y <- df$Shannon
fit <- lm(y ~ x)
summary(fit)

Multiple R-squared:  0.0004, p-value: 0.927

R² = 0.0004, p = 0.927 — pH has no significant linear relationship with Shannon diversity in this dataset. That's OK. Non-significant results are normal in real research.

Plot it:

par(mar = c(5, 5, 3, 1))
plot(x, y,
     xlab = "pH", ylab = "Shannon Index",
     pch = 16, cex = 1.5, col = "#00AFBB",
     cex.lab = 1.8, cex.axis = 1.3, las = 1)
abline(fit, lwd = 3, col = "#00AFBB")

Multi-Panel: All 6 Factors at Once

Doing one plot at a time is inefficient. Use layout() + for loop:

x_vars <- c("pH", "AK", "AP", "AN", "OM", "W")
x_labels <- c("pH", "AK (mg/kg)", "AP (mg/kg)",
              "AN (mg/kg)", "SOM (g/kg)", "W (%)")

layout(matrix(1:6, 2, 3, byrow = TRUE))
par(mar = c(4.5, 4.5, 3, 1))

for (i in seq_along(x_vars)) {
  fit <- lm(df$Shannon ~ df[[x_vars[i]]])
  s <- summary(fit)

  plot(df[[x_vars[i]]], df$Shannon,
       xlab = x_labels[i], ylab = "Shannon",
       pch = 16, cex = 1.3, col = "#00AFBB",
       cex.lab = 1.5, cex.axis = 1.2, las = 1)
  abline(fit, lwd = 3, col = "#00AFBB")

  # Annotate R² and p-value
  r2 <- round(s$r.squared, 3)
  p  <- s$coefficients[2, 4]
  p_label <- ifelse(p < 0.001, "p<0.001", paste0("p=", round(p, 3)))

  x_pos <- par("usr")[2] - diff(par("usr")[1:2]) * 0.25
  y_pos <- par("usr")[4] - diff(par("usr")[3:4]) * 0.1
  text(x_pos, y_pos, labels = bquote(R^2 == .(r2) ~ " " ~ .(p_label)),
       cex = 1.1, adj = c(1, 1))
}

Save at 300 dpi for publication:

png("scatter_6panel.png", width = 3000, height = 2000, res = 300)
# ... plotting code ...
dev.off()

Results at a Glance

Factor	R²	p-value	Significant?
pH	0.000	0.927	❌
AK	0.088	0.160	❌
AP	0.012	0.615	❌
AN	0.220	0.021	✅
OM	0.540	<0.001	✅✅
W	0.000	0.956	❌

Organic matter (OM) is the strongest predictor — it explains 54% of the variance in Shannon diversity.

Complete Script (Copy & Run)

# ================================================================
# Scatter Plots + Linear Regression: Environmental Factors vs. Diversity
# ================================================================

plot_one <- function(x, y, xlab, ylab) {
  fit <- lm(y ~ x)
  s <- summary(fit)
  r2 <- round(s$r.squared, 3)
  p  <- s$coefficients[2, 4]
  p_label <- ifelse(p < 0.001, "p<0.001", paste0("p=", round(p, 3)))

  plot(x, y, xlab = xlab, ylab = ylab,
       pch = 16, cex = 1.3, col = "#00AFBB",
       cex.lab = 1.5, cex.axis = 1.2, las = 1)
  abline(fit, lwd = 3, col = "#00AFBB")

  x_pos <- par("usr")[2] - diff(par("usr")[1:2]) * 0.25
  y_pos <- par("usr")[4] - diff(par("usr")[3:4]) * 0.1
  text(x_pos, y_pos, labels = bquote(R^2 == .(r2) ~ " " ~ .(p_label)),
       cex = 1.1, adj = c(1, 1))
}

df <- read.csv("sandian.csv", header = TRUE)

x_vars   <- c("pH", "AK", "AP", "AN", "OM", "W")
x_labels <- c("pH", "AK (mg/kg)", "AP (mg/kg)",
              "AN (mg/kg)", "SOM (g/kg)", "W (%)")

png("scatter_6panel.png", width = 3000, height = 2000, res = 300)
layout(matrix(1:6, 2, 3, byrow = TRUE))
par(mar = c(4.5, 4.5, 3, 1))

for (i in seq_along(x_vars)) {
  plot_one(df[[x_vars[i]]], df$Shannon, x_labels[i], "Shannon")
}
dev.off()

To adapt for your data: change x_vars to your column names, x_labels to your axis labels, and df$Shannon to your diversity index.

Parameter Cheat Sheet

param	what it does	common values
`pch`	point shape	16 (filled), 1 (empty), 17 (triangle)
`cex`	point/text size	1.3–1.5 (moderate)
`col`	color	`#00AFBB` (teal), `#E69F00` (orange)
`lty`	line type	1 (solid), 2 (dashed)
`lwd`	line width	2–3 (publication)
`res`	export dpi	300 (journal)

When R² Is Not Significant

Normal. Don't only show significant results.
Try non-linear: poly(x, 2) for quadratic terms.
Small n (< 30) : may lack statistical power, not ecological meaning.

*Follow me on Dev.to for more R tutorials.

Tags: r tutorial datascience ecology

DEV Community