peisen yan

Posted on May 31

Correlation Analysis in R: From Pearson to Spearman with corrplot

#r #datascience #tutorial #dataviz

When studying how environmental factors shape biodiversity, correlation analysis is usually the first stop. But which method should you use — Pearson, Spearman, or Kendall? Pick wrong and you might draw misleading conclusions.

This tutorial walks through the complete workflow using real-world ecological data (50 sites × 12 variables), covering method selection, correlation computation, significance testing, and corrplot visualization at publication quality.

Quick Reference: Which Method to Use?

Method	Normality Required	Relationship	Outlier Sensitivity	Best For
Pearson	Yes	Linear	Sensitive	Normally distributed continuous data
Spearman	No	Monotonic	Robust	Default for ecological data ✅
Kendall	No	Monotonic	Robust	Small samples (n < 30) with many ties

Rule of thumb: Run Shapiro-Wilk first. If all variables are normal → Pearson. Otherwise → Spearman (most common in ecology).

Data Overview

The dataset contains 50 sampling sites with 12 variables:

Environmental factors (6): pH, TN (total nitrogen), TP (total phosphorus), SOM (soil organic matter), MAP (mean annual precipitation), MAT (mean annual temperature)

Diversity indices (6): PD (phylogenetic diversity), NRI (net relatedness index), NTI (nearest taxon index), Shannon-Wiener index, Simpson index, S (species richness)

# Load data
df <- read.csv("alpha_diversity.csv", header = TRUE)
head(df)

The data file (alpha_diversity.csv) contains 50 rows and 12 numeric columns — one row per sampling site, one column per variable.

Step 1: Normality Test

Before choosing a correlation method, test whether your data follows a normal distribution:

# Shapiro-Wilk normality test for all variables
p_values <- sapply(df, function(x) shapiro.test(x)$p.value)
norm_table <- data.frame(
  Variable = names(df),
  p_value = round(p_values, 4),
  Normal = ifelse(p_values > 0.05, "Yes", "No")
)
print(norm_table)

In our dataset, most environmental factors and diversity indices have p < 0.05 (non-normal). This means Spearman is the appropriate choice.

Step 2: Calculate Correlation Coefficients

R's cor() function computes all three methods in one line:

# All three methods at once
pearson_r  <- cor(df, method = "pearson")
spearman_r <- cor(df, method = "spearman")
kendall_r  <- cor(df, method = "kendall")

The sign and relative magnitude are broadly consistent across methods. When a relationship is significant under Spearman but not Pearson, it often indicates a non-linear monotonic relationship — exactly where Spearman excels.

Step 3: Get Coefficients and p-values Together

cor() only returns the correlation matrix. For significance tests, use rcorr() from the Hmisc package:

library(Hmisc)
res <- rcorr(as.matrix(df), type = "spearman")

res$r  # Correlation coefficients
res$P  # p-values

Example interpretation (pH vs. Shannon):

Correlation coefficient r = 0.54
p < 0.001 (highly significant)
Interpretation: Higher pH is associated with higher Shannon diversity — soil acidity is an important driver of community diversity.

# Export results
write.csv(round(res$r, 3), "spearman_corr_coef.csv")
write.csv(round(res$P, 4), "spearman_p_values.csv")

Step 4: corrplot Visualization

Let's build up from a basic plot to a publication-ready figure. All examples use corrplot.

4.1 Basic corrplot — One Line

library(corrplot)
corrplot(res$r)

This gives you a colored circle matrix out of the box — blue for positive, red for negative, color intensity reflects strength. It's useful for a quick look but not publication-ready (no labels, no sorting).

4.2 Custom Colors + Upper/Lower Triangular Mix

Three improvements:

order = "AOE" — auto-cluster by angular order of eigenvectors
Upper triangle: colored circles; Lower triangle: numbers
Custom blue-white-red color palette

corrplot(res$r, order = "AOE", type = "upper", tl.pos = "d",
         tl.col = "black", tl.cex = 0.8,
         col = colorRampPalette(c("#0072B5", "white", "#BC3C29"))(200))
corrplot(res$r, add = TRUE, type = "lower", method = "number",
         order = "AOE", diag = FALSE, tl.pos = "n", cl.pos = "n",
         col = "black", number.cex = 0.7)

Key parameters:

order = "AOE": Reorders variables to reveal clustered structure
add = TRUE: Overlays the second layer (numbers) on top of the first (circles)
tl.pos = "d": Places variable labels on the diagonal to avoid overlap

4.3 Publication Standard: Significance Star Annotation ⭐

This is the format most commonly seen in peer-reviewed papers — all correlations shown, significant ones marked with stars (* p < 0.05, ** p < 0.01, *** p < 0.001):

p_mat <- res$P
diag(p_mat) <- 1
order_aoe <- corrMatOrder(res$r, order = "AOE")

corrplot(res$r[order_aoe, order_aoe], type = "upper", tl.pos = "d",
         tl.col = "black", tl.cex = 0.8,
         col = colorRampPalette(c("#0072B5", "white", "#BC3C29"))(200),
         p.mat = p_mat[order_aoe, order_aoe],
         sig.level = c(0.001, 0.01, 0.05),
         insig = "label_sig",
         pch.cex = 1.0, pch.col = "black")
corrplot(res$r[order_aoe, order_aoe], add = TRUE, type = "lower",
         method = "number", diag = FALSE, tl.pos = "n", cl.pos = "n",
         col = "black", number.cex = 0.7)

Critical details:

sig.level must be in ascending order (0.001, 0.01, 0.05)
insig = "label_sig" annotates significant correlations with stars
p.mat must use the same AOE ordering as the correlation matrix to keep upper and lower triangles aligned
Pre-compute order_aoe with corrMatOrder() and apply it to both res$r and p_mat

Step 5: Reading the Results

The corrplot reveals clear patterns in environment–diversity relationships:

Environment vs. Diversity:

pH positively correlated with Shannon (r = 0.54, p < 0.001), Simpson (r = 0.46, p < 0.001), and S (r = 0.44, p < 0.01) — near-neutral to alkaline soils support higher diversity
TN shows consistent positive correlations with diversity indices (r ≈ 0.43–0.53) — nitrogen availability is another key driver
SOM only weakly correlates with PD (r = 0.32, p < 0.05)

Among diversity indices:

Shannon and Simpson are strongly correlated (r = 0.85) — they capture overlapping information
Shannon and S also highly correlated (r = 0.74) — Shannon is inherently influenced by richness
NRI and NTI moderately correlated (r = 0.53) — both reflect phylogenetic structure

Method note: Comparing Pearson and Spearman results, most significant relationships agree in direction, but Spearman's rank correlation is more robust when normality assumptions are violated. Spearman is recommended as the default in ecological studies.

Complete Runnable Script

# ============================================================
# Complete Correlation Analysis: Pearson / Spearman / Kendall
# ============================================================
# First run: install.packages(c("Hmisc", "corrplot"))

library(Hmisc)
library(corrplot)

# --- Load data ---
df <- read.csv("alpha_diversity.csv", header = TRUE)

# --- Normality test ---
p_vals <- sapply(df, function(x) shapiro.test(x)$p.value)
data.frame(Variable = names(df), p_value = round(p_vals, 4),
           Normal = ifelse(p_vals > 0.05, "Yes", "No"))

# --- Three correlation methods ---
cor(df, method = "pearson")
cor(df, method = "spearman")
cor(df, method = "kendall")

# --- Spearman with p-values ---
res <- rcorr(as.matrix(df), type = "spearman")
res$r
res$P

# --- Visualization ---

# 1) Basic
pdf("corrplot_basic.pdf", width = 10, height = 8)
corrplot(res$r)
dev.off()

# 2) Custom colors + upper/lower mix
pdf("corrplot_color.pdf", width = 10, height = 8)
corrplot(res$r, order = "AOE", type = "upper", tl.pos = "d",
         tl.col = "black", tl.cex = 0.8,
         col = colorRampPalette(c("#0072B5", "white", "#BC3C29"))(200))
corrplot(res$r, add = TRUE, type = "lower", method = "number",
         order = "AOE", diag = FALSE, tl.pos = "n", cl.pos = "n",
         col = "black", number.cex = 0.7)
dev.off()

# 3) Publication standard with significance stars
p_mat <- res$P
diag(p_mat) <- 1
order_aoe <- corrMatOrder(res$r, order = "AOE")
pdf("corrplot_significance.pdf", width = 10, height = 8)
corrplot(res$r[order_aoe, order_aoe], type = "upper", tl.pos = "d",
         tl.col = "black", tl.cex = 0.8,
         col = colorRampPalette(c("#0072B5", "white", "#BC3C29"))(200),
         p.mat = p_mat[order_aoe, order_aoe],
         sig.level = c(0.001, 0.01, 0.05),
         insig = "label_sig",
         pch.cex = 1.0, pch.col = "black")
corrplot(res$r[order_aoe, order_aoe], add = TRUE, type = "lower",
         method = "number", diag = FALSE, tl.pos = "n", cl.pos = "n",
         col = "black", number.cex = 0.7)
dev.off()

To use with your own data: Prepare a CSV file where each column is a numeric variable and each row is an observation. Replace "alpha_diversity.csv" in the read.csv() call with your file name.

Key Takeaways

Check normality first — Shapiro-Wilk test tells you which correlation method to use
Spearman is the safe default for ecological and environmental data
corrplot with order = "AOE" reveals clustered structure automatically
Significance star annotations make your figure publication-ready
Correlation ≠ causation — even r = 0.9 doesn't mean A causes B
Multiple testing — with 12 variables (66 pairs), some significant results may be random. Consider Bonferroni or FDR correction for strict inference

Need help with R data analysis? I provide data analysis, visualization, and publication-ready figure services.

Follow me on Dev.to for more R tutorials. GitHub and project links available on request.

Let me know in the comments if you have questions or want the example dataset!

DEV Community