DEV Community

peisen yan
peisen yan

Posted on

Correlation Analysis in R: From Pearson to Spearman with corrplot

When studying how environmental factors shape biodiversity, correlation analysis is usually the first stop. But which method should you use — Pearson, Spearman, or Kendall? Pick wrong and you might draw misleading conclusions.

This tutorial walks through the complete workflow using real-world ecological data (50 sites × 12 variables), covering method selection, correlation computation, significance testing, and corrplot visualization at publication quality.


Quick Reference: Which Method to Use?

Method Normality Required Relationship Outlier Sensitivity Best For
Pearson Yes Linear Sensitive Normally distributed continuous data
Spearman No Monotonic Robust Default for ecological data
Kendall No Monotonic Robust Small samples (n < 30) with many ties

Rule of thumb: Run Shapiro-Wilk first. If all variables are normal → Pearson. Otherwise → Spearman (most common in ecology).


Data Overview

The dataset contains 50 sampling sites with 12 variables:

Environmental factors (6): pH, TN (total nitrogen), TP (total phosphorus), SOM (soil organic matter), MAP (mean annual precipitation), MAT (mean annual temperature)

Diversity indices (6): PD (phylogenetic diversity), NRI (net relatedness index), NTI (nearest taxon index), Shannon-Wiener index, Simpson index, S (species richness)

# Load data
df <- read.csv("alpha_diversity.csv", header = TRUE)
head(df)
Enter fullscreen mode Exit fullscreen mode

The data file (alpha_diversity.csv) contains 50 rows and 12 numeric columns — one row per sampling site, one column per variable.


Step 1: Normality Test

Before choosing a correlation method, test whether your data follows a normal distribution:

# Shapiro-Wilk normality test for all variables
p_values <- sapply(df, function(x) shapiro.test(x)$p.value)
norm_table <- data.frame(
  Variable = names(df),
  p_value = round(p_values, 4),
  Normal = ifelse(p_values > 0.05, "Yes", "No")
)
print(norm_table)
Enter fullscreen mode Exit fullscreen mode

In our dataset, most environmental factors and diversity indices have p < 0.05 (non-normal). This means Spearman is the appropriate choice.


Step 2: Calculate Correlation Coefficients

R's cor() function computes all three methods in one line:

# All three methods at once
pearson_r  <- cor(df, method = "pearson")
spearman_r <- cor(df, method = "spearman")
kendall_r  <- cor(df, method = "kendall")
Enter fullscreen mode Exit fullscreen mode

The sign and relative magnitude are broadly consistent across methods. When a relationship is significant under Spearman but not Pearson, it often indicates a non-linear monotonic relationship — exactly where Spearman excels.


Step 3: Get Coefficients and p-values Together

cor() only returns the correlation matrix. For significance tests, use rcorr() from the Hmisc package:

library(Hmisc)
res <- rcorr(as.matrix(df), type = "spearman")

res$r  # Correlation coefficients
res$P  # p-values
Enter fullscreen mode Exit fullscreen mode

Example interpretation (pH vs. Shannon):

  • Correlation coefficient r = 0.54
  • p < 0.001 (highly significant)
  • Interpretation: Higher pH is associated with higher Shannon diversity — soil acidity is an important driver of community diversity.
# Export results
write.csv(round(res$r, 3), "spearman_corr_coef.csv")
write.csv(round(res$P, 4), "spearman_p_values.csv")
Enter fullscreen mode Exit fullscreen mode

Step 4: corrplot Visualization

Let's build up from a basic plot to a publication-ready figure. All examples use corrplot.

4.1 Basic corrplot — One Line

library(corrplot)
corrplot(res$r)
Enter fullscreen mode Exit fullscreen mode

This gives you a colored circle matrix out of the box — blue for positive, red for negative, color intensity reflects strength. It's useful for a quick look but not publication-ready (no labels, no sorting).

4.2 Custom Colors + Upper/Lower Triangular Mix

Three improvements:

  1. order = "AOE" — auto-cluster by angular order of eigenvectors
  2. Upper triangle: colored circles; Lower triangle: numbers
  3. Custom blue-white-red color palette
corrplot(res$r, order = "AOE", type = "upper", tl.pos = "d",
         tl.col = "black", tl.cex = 0.8,
         col = colorRampPalette(c("#0072B5", "white", "#BC3C29"))(200))
corrplot(res$r, add = TRUE, type = "lower", method = "number",
         order = "AOE", diag = FALSE, tl.pos = "n", cl.pos = "n",
         col = "black", number.cex = 0.7)
Enter fullscreen mode Exit fullscreen mode

Key parameters:

  • order = "AOE": Reorders variables to reveal clustered structure
  • add = TRUE: Overlays the second layer (numbers) on top of the first (circles)
  • tl.pos = "d": Places variable labels on the diagonal to avoid overlap

4.3 Publication Standard: Significance Star Annotation ⭐

This is the format most commonly seen in peer-reviewed papers — all correlations shown, significant ones marked with stars (* p < 0.05, ** p < 0.01, *** p < 0.001):

p_mat <- res$P
diag(p_mat) <- 1
order_aoe <- corrMatOrder(res$r, order = "AOE")

corrplot(res$r[order_aoe, order_aoe], type = "upper", tl.pos = "d",
         tl.col = "black", tl.cex = 0.8,
         col = colorRampPalette(c("#0072B5", "white", "#BC3C29"))(200),
         p.mat = p_mat[order_aoe, order_aoe],
         sig.level = c(0.001, 0.01, 0.05),
         insig = "label_sig",
         pch.cex = 1.0, pch.col = "black")
corrplot(res$r[order_aoe, order_aoe], add = TRUE, type = "lower",
         method = "number", diag = FALSE, tl.pos = "n", cl.pos = "n",
         col = "black", number.cex = 0.7)
Enter fullscreen mode Exit fullscreen mode

Critical details:

  • sig.level must be in ascending order (0.001, 0.01, 0.05)
  • insig = "label_sig" annotates significant correlations with stars
  • p.mat must use the same AOE ordering as the correlation matrix to keep upper and lower triangles aligned
  • Pre-compute order_aoe with corrMatOrder() and apply it to both res$r and p_mat

Step 5: Reading the Results

The corrplot reveals clear patterns in environment–diversity relationships:

Environment vs. Diversity:

  • pH positively correlated with Shannon (r = 0.54, p < 0.001), Simpson (r = 0.46, p < 0.001), and S (r = 0.44, p < 0.01) — near-neutral to alkaline soils support higher diversity
  • TN shows consistent positive correlations with diversity indices (r ≈ 0.43–0.53) — nitrogen availability is another key driver
  • SOM only weakly correlates with PD (r = 0.32, p < 0.05)

Among diversity indices:

  • Shannon and Simpson are strongly correlated (r = 0.85) — they capture overlapping information
  • Shannon and S also highly correlated (r = 0.74) — Shannon is inherently influenced by richness
  • NRI and NTI moderately correlated (r = 0.53) — both reflect phylogenetic structure

Method note: Comparing Pearson and Spearman results, most significant relationships agree in direction, but Spearman's rank correlation is more robust when normality assumptions are violated. Spearman is recommended as the default in ecological studies.


Complete Runnable Script

# ============================================================
# Complete Correlation Analysis: Pearson / Spearman / Kendall
# ============================================================
# First run: install.packages(c("Hmisc", "corrplot"))

library(Hmisc)
library(corrplot)

# --- Load data ---
df <- read.csv("alpha_diversity.csv", header = TRUE)

# --- Normality test ---
p_vals <- sapply(df, function(x) shapiro.test(x)$p.value)
data.frame(Variable = names(df), p_value = round(p_vals, 4),
           Normal = ifelse(p_vals > 0.05, "Yes", "No"))

# --- Three correlation methods ---
cor(df, method = "pearson")
cor(df, method = "spearman")
cor(df, method = "kendall")

# --- Spearman with p-values ---
res <- rcorr(as.matrix(df), type = "spearman")
res$r
res$P

# --- Visualization ---

# 1) Basic
pdf("corrplot_basic.pdf", width = 10, height = 8)
corrplot(res$r)
dev.off()

# 2) Custom colors + upper/lower mix
pdf("corrplot_color.pdf", width = 10, height = 8)
corrplot(res$r, order = "AOE", type = "upper", tl.pos = "d",
         tl.col = "black", tl.cex = 0.8,
         col = colorRampPalette(c("#0072B5", "white", "#BC3C29"))(200))
corrplot(res$r, add = TRUE, type = "lower", method = "number",
         order = "AOE", diag = FALSE, tl.pos = "n", cl.pos = "n",
         col = "black", number.cex = 0.7)
dev.off()

# 3) Publication standard with significance stars
p_mat <- res$P
diag(p_mat) <- 1
order_aoe <- corrMatOrder(res$r, order = "AOE")
pdf("corrplot_significance.pdf", width = 10, height = 8)
corrplot(res$r[order_aoe, order_aoe], type = "upper", tl.pos = "d",
         tl.col = "black", tl.cex = 0.8,
         col = colorRampPalette(c("#0072B5", "white", "#BC3C29"))(200),
         p.mat = p_mat[order_aoe, order_aoe],
         sig.level = c(0.001, 0.01, 0.05),
         insig = "label_sig",
         pch.cex = 1.0, pch.col = "black")
corrplot(res$r[order_aoe, order_aoe], add = TRUE, type = "lower",
         method = "number", diag = FALSE, tl.pos = "n", cl.pos = "n",
         col = "black", number.cex = 0.7)
dev.off()
Enter fullscreen mode Exit fullscreen mode

To use with your own data: Prepare a CSV file where each column is a numeric variable and each row is an observation. Replace "alpha_diversity.csv" in the read.csv() call with your file name.


Key Takeaways

  • Check normality first — Shapiro-Wilk test tells you which correlation method to use
  • Spearman is the safe default for ecological and environmental data
  • corrplot with order = "AOE" reveals clustered structure automatically
  • Significance star annotations make your figure publication-ready
  • Correlation ≠ causation — even r = 0.9 doesn't mean A causes B
  • Multiple testing — with 12 variables (66 pairs), some significant results may be random. Consider Bonferroni or FDR correction for strict inference

Need help with R data analysis? I provide data analysis, visualization, and publication-ready figure services.

Follow me on Dev.to for more R tutorials. GitHub and project links available on request.

Let me know in the comments if you have questions or want the example dataset!

Top comments (0)