The Problem
You have environmental data (pH, organic matter, nutrients) and biodiversity data (Shannon index). You want to know: which environmental factors significantly affect diversity?
The quickest way to answer this is scatter plots + linear regression.
Here's a complete, runnable R workflow using real soil microbiome data — from a simple lm() call to a multi-panel publication-ready figure. Everything uses base R graphics, no ggplot2 required.
Data
This dataset has 24 soil samples with 6 environmental factors and 2 diversity indices:
| Variable | Description | Unit |
|---|---|---|
| pH | Soil acidity | — |
| AK | Available potassium | mg/kg |
| AP | Available phosphorus | mg/kg |
| AN | Available nitrogen | mg/kg |
| OM | Organic matter | g/kg |
| W | Water content | % |
| Shannon | Species diversity | — |
| PD | Phylogenetic diversity | — |
# Read and inspect
df <- read.csv("sandian.csv", header = TRUE)
str(df)
head(df)
Single Factor: pH → Shannon
x <- df$pH
y <- df$Shannon
fit <- lm(y ~ x)
summary(fit)
Multiple R-squared: 0.0004, p-value: 0.927
R² = 0.0004, p = 0.927 — pH has no significant linear relationship with Shannon diversity in this dataset. That's OK. Non-significant results are normal in real research.
Plot it:
par(mar = c(5, 5, 3, 1))
plot(x, y,
xlab = "pH", ylab = "Shannon Index",
pch = 16, cex = 1.5, col = "#00AFBB",
cex.lab = 1.8, cex.axis = 1.3, las = 1)
abline(fit, lwd = 3, col = "#00AFBB")
Multi-Panel: All 6 Factors at Once
Doing one plot at a time is inefficient. Use layout() + for loop:
x_vars <- c("pH", "AK", "AP", "AN", "OM", "W")
x_labels <- c("pH", "AK (mg/kg)", "AP (mg/kg)",
"AN (mg/kg)", "SOM (g/kg)", "W (%)")
layout(matrix(1:6, 2, 3, byrow = TRUE))
par(mar = c(4.5, 4.5, 3, 1))
for (i in seq_along(x_vars)) {
fit <- lm(df$Shannon ~ df[[x_vars[i]]])
s <- summary(fit)
plot(df[[x_vars[i]]], df$Shannon,
xlab = x_labels[i], ylab = "Shannon",
pch = 16, cex = 1.3, col = "#00AFBB",
cex.lab = 1.5, cex.axis = 1.2, las = 1)
abline(fit, lwd = 3, col = "#00AFBB")
# Annotate R² and p-value
r2 <- round(s$r.squared, 3)
p <- s$coefficients[2, 4]
p_label <- ifelse(p < 0.001, "p<0.001", paste0("p=", round(p, 3)))
x_pos <- par("usr")[2] - diff(par("usr")[1:2]) * 0.25
y_pos <- par("usr")[4] - diff(par("usr")[3:4]) * 0.1
text(x_pos, y_pos, labels = bquote(R^2 == .(r2) ~ " " ~ .(p_label)),
cex = 1.1, adj = c(1, 1))
}
Save at 300 dpi for publication:
png("scatter_6panel.png", width = 3000, height = 2000, res = 300)
# ... plotting code ...
dev.off()
Results at a Glance
| Factor | R² | p-value | Significant? |
|---|---|---|---|
| pH | 0.000 | 0.927 | ❌ |
| AK | 0.088 | 0.160 | ❌ |
| AP | 0.012 | 0.615 | ❌ |
| AN | 0.220 | 0.021 | ✅ |
| OM | 0.540 | <0.001 | ✅✅ |
| W | 0.000 | 0.956 | ❌ |
Organic matter (OM) is the strongest predictor — it explains 54% of the variance in Shannon diversity.
Complete Script (Copy & Run)
# ================================================================
# Scatter Plots + Linear Regression: Environmental Factors vs. Diversity
# ================================================================
plot_one <- function(x, y, xlab, ylab) {
fit <- lm(y ~ x)
s <- summary(fit)
r2 <- round(s$r.squared, 3)
p <- s$coefficients[2, 4]
p_label <- ifelse(p < 0.001, "p<0.001", paste0("p=", round(p, 3)))
plot(x, y, xlab = xlab, ylab = ylab,
pch = 16, cex = 1.3, col = "#00AFBB",
cex.lab = 1.5, cex.axis = 1.2, las = 1)
abline(fit, lwd = 3, col = "#00AFBB")
x_pos <- par("usr")[2] - diff(par("usr")[1:2]) * 0.25
y_pos <- par("usr")[4] - diff(par("usr")[3:4]) * 0.1
text(x_pos, y_pos, labels = bquote(R^2 == .(r2) ~ " " ~ .(p_label)),
cex = 1.1, adj = c(1, 1))
}
df <- read.csv("sandian.csv", header = TRUE)
x_vars <- c("pH", "AK", "AP", "AN", "OM", "W")
x_labels <- c("pH", "AK (mg/kg)", "AP (mg/kg)",
"AN (mg/kg)", "SOM (g/kg)", "W (%)")
png("scatter_6panel.png", width = 3000, height = 2000, res = 300)
layout(matrix(1:6, 2, 3, byrow = TRUE))
par(mar = c(4.5, 4.5, 3, 1))
for (i in seq_along(x_vars)) {
plot_one(df[[x_vars[i]]], df$Shannon, x_labels[i], "Shannon")
}
dev.off()
To adapt for your data: change x_vars to your column names, x_labels to your axis labels, and df$Shannon to your diversity index.
Parameter Cheat Sheet
| param | what it does | common values |
|---|---|---|
pch |
point shape | 16 (filled), 1 (empty), 17 (triangle) |
cex |
point/text size | 1.3–1.5 (moderate) |
col |
color |
#00AFBB (teal), #E69F00 (orange) |
lty |
line type | 1 (solid), 2 (dashed) |
lwd |
line width | 2–3 (publication) |
res |
export dpi | 300 (journal) |
When R² Is Not Significant
- Normal. Don't only show significant results.
-
Try non-linear:
poly(x, 2)for quadratic terms. - Small n (< 30) : may lack statistical power, not ecological meaning.
*Follow me on Dev.to for more R tutorials.
Tags: r tutorial datascience ecology



Top comments (0)