DEV Community

peisen yan
peisen yan

Posted on

Scatter Plots with Linear Regression in R (with Multi-Panel Layouts)

The Problem

You have environmental data (pH, organic matter, nutrients) and biodiversity data (Shannon index). You want to know: which environmental factors significantly affect diversity?

The quickest way to answer this is scatter plots + linear regression.

Here's a complete, runnable R workflow using real soil microbiome data — from a simple lm() call to a multi-panel publication-ready figure. Everything uses base R graphics, no ggplot2 required.

Data

This dataset has 24 soil samples with 6 environmental factors and 2 diversity indices:

Variable Description Unit
pH Soil acidity
AK Available potassium mg/kg
AP Available phosphorus mg/kg
AN Available nitrogen mg/kg
OM Organic matter g/kg
W Water content %
Shannon Species diversity
PD Phylogenetic diversity
# Read and inspect
df <- read.csv("sandian.csv", header = TRUE)
str(df)
head(df)
Enter fullscreen mode Exit fullscreen mode

Single Factor: pH → Shannon

x <- df$pH
y <- df$Shannon
fit <- lm(y ~ x)
summary(fit)
Enter fullscreen mode Exit fullscreen mode
Multiple R-squared:  0.0004, p-value: 0.927
Enter fullscreen mode Exit fullscreen mode

R² = 0.0004, p = 0.927 — pH has no significant linear relationship with Shannon diversity in this dataset. That's OK. Non-significant results are normal in real research.

Plot it:

par(mar = c(5, 5, 3, 1))
plot(x, y,
     xlab = "pH", ylab = "Shannon Index",
     pch = 16, cex = 1.5, col = "#00AFBB",
     cex.lab = 1.8, cex.axis = 1.3, las = 1)
abline(fit, lwd = 3, col = "#00AFBB")
Enter fullscreen mode Exit fullscreen mode

Multi-Panel: All 6 Factors at Once

Doing one plot at a time is inefficient. Use layout() + for loop:

x_vars <- c("pH", "AK", "AP", "AN", "OM", "W")
x_labels <- c("pH", "AK (mg/kg)", "AP (mg/kg)",
              "AN (mg/kg)", "SOM (g/kg)", "W (%)")

layout(matrix(1:6, 2, 3, byrow = TRUE))
par(mar = c(4.5, 4.5, 3, 1))

for (i in seq_along(x_vars)) {
  fit <- lm(df$Shannon ~ df[[x_vars[i]]])
  s <- summary(fit)

  plot(df[[x_vars[i]]], df$Shannon,
       xlab = x_labels[i], ylab = "Shannon",
       pch = 16, cex = 1.3, col = "#00AFBB",
       cex.lab = 1.5, cex.axis = 1.2, las = 1)
  abline(fit, lwd = 3, col = "#00AFBB")

  # Annotate R² and p-value
  r2 <- round(s$r.squared, 3)
  p  <- s$coefficients[2, 4]
  p_label <- ifelse(p < 0.001, "p<0.001", paste0("p=", round(p, 3)))

  x_pos <- par("usr")[2] - diff(par("usr")[1:2]) * 0.25
  y_pos <- par("usr")[4] - diff(par("usr")[3:4]) * 0.1
  text(x_pos, y_pos, labels = bquote(R^2 == .(r2) ~ " " ~ .(p_label)),
       cex = 1.1, adj = c(1, 1))
}
Enter fullscreen mode Exit fullscreen mode

Save at 300 dpi for publication:

png("scatter_6panel.png", width = 3000, height = 2000, res = 300)
# ... plotting code ...
dev.off()
Enter fullscreen mode Exit fullscreen mode

Results at a Glance

Factor p-value Significant?
pH 0.000 0.927
AK 0.088 0.160
AP 0.012 0.615
AN 0.220 0.021
OM 0.540 <0.001 ✅✅
W 0.000 0.956

Organic matter (OM) is the strongest predictor — it explains 54% of the variance in Shannon diversity.

Complete Script (Copy & Run)

# ================================================================
# Scatter Plots + Linear Regression: Environmental Factors vs. Diversity
# ================================================================

plot_one <- function(x, y, xlab, ylab) {
  fit <- lm(y ~ x)
  s <- summary(fit)
  r2 <- round(s$r.squared, 3)
  p  <- s$coefficients[2, 4]
  p_label <- ifelse(p < 0.001, "p<0.001", paste0("p=", round(p, 3)))

  plot(x, y, xlab = xlab, ylab = ylab,
       pch = 16, cex = 1.3, col = "#00AFBB",
       cex.lab = 1.5, cex.axis = 1.2, las = 1)
  abline(fit, lwd = 3, col = "#00AFBB")

  x_pos <- par("usr")[2] - diff(par("usr")[1:2]) * 0.25
  y_pos <- par("usr")[4] - diff(par("usr")[3:4]) * 0.1
  text(x_pos, y_pos, labels = bquote(R^2 == .(r2) ~ " " ~ .(p_label)),
       cex = 1.1, adj = c(1, 1))
}

df <- read.csv("sandian.csv", header = TRUE)

x_vars   <- c("pH", "AK", "AP", "AN", "OM", "W")
x_labels <- c("pH", "AK (mg/kg)", "AP (mg/kg)",
              "AN (mg/kg)", "SOM (g/kg)", "W (%)")

png("scatter_6panel.png", width = 3000, height = 2000, res = 300)
layout(matrix(1:6, 2, 3, byrow = TRUE))
par(mar = c(4.5, 4.5, 3, 1))

for (i in seq_along(x_vars)) {
  plot_one(df[[x_vars[i]]], df$Shannon, x_labels[i], "Shannon")
}
dev.off()
Enter fullscreen mode Exit fullscreen mode

To adapt for your data: change x_vars to your column names, x_labels to your axis labels, and df$Shannon to your diversity index.

Parameter Cheat Sheet

param what it does common values
pch point shape 16 (filled), 1 (empty), 17 (triangle)
cex point/text size 1.3–1.5 (moderate)
col color #00AFBB (teal), #E69F00 (orange)
lty line type 1 (solid), 2 (dashed)
lwd line width 2–3 (publication)
res export dpi 300 (journal)

When R² Is Not Significant

  • Normal. Don't only show significant results.
  • Try non-linear: poly(x, 2) for quadratic terms.
  • Small n (< 30) : may lack statistical power, not ecological meaning.

*Follow me on Dev.to for more R tutorials.

Tags: r tutorial datascience ecology

Top comments (0)