Cluster Analysis in R: Hierarchical Clustering & K-means
Samples with similar species composition often reflect similar environmental conditions. Cluster analysis quantifies this similarity and helps ecologists understand community structure.
This tutorial demonstrates a complete workflow from distance calculation to hierarchical clustering and K-means clustering using simulated community data (30 plots × 20 species).
Data Preparation
The dataset contains 30 plots and 20 species, designed with 3 community types (10 plots each):
| Type | Plots | Dominant Species | Characteristic |
|---|---|---|---|
| Type A | Plot01-10 | Sp01-Sp08 | Higher abundance of species 1-8 |
| Type B | Plot11-20 | Sp09-Sp15 | Higher abundance of species 9-15 |
| Type C | Plot21-30 | Sp16-Sp20 | Higher abundance of species 16-20 |
# Read data
df <- read.csv("community_data.csv", header = TRUE, row.names = 1)
head(df[, 1:6])
Hierarchical Clustering
Step 1: Calculate Distance Matrix
For species abundance data, Jaccard distance or Bray-Curtis distance are most commonly used.
library(vegan)
dist_jac <- vegdist(df, method = "jaccard")
Step 2: UPGMA Clustering
hc_upgma <- hclust(dist_jac, method = "average")
method = "average" is UPGMA (Unweighted Pair Group Method with Arithmetic Mean). Other options:
| Method | Description |
|---|---|
"average" |
UPGMA, most common, balanced branches |
"ward.D2" |
Ward's minimum variance, compact clusters |
"complete" |
Complete linkage, based on farthest points |
"single" |
Single linkage, based on nearest points |
Step 3: Plot Dendrogram
# Vertical dendrogram
par(mar = c(4, 4, 3, 8))
plot(hc_upgma, hang = -1, main = "Hierarchical Clustering (UPGMA, Jaccard)",
xlab = "Sample", ylab = "Jaccard Distance", sub = "", cex = 0.7)
rect.hclust(hc_upgma, k = 3, border = c("#E64B35", "#4DBBD5", "#00A087"))
Parameter notes:
-
hang = -1: Aligns all branch tips to the same horizontal line -
rect.hclust(k = 3): Draws rectangles around 3 clusters -
cex = 0.7: Adjusts label font size
# Horizontal dendrogram (better for many samples)
plot(as.dendrogram(hc_upgma), horiz = TRUE,
main = "Hierarchical Clustering (UPGMA, Jaccard)")
Step 4: Cut the Tree
hc_clusters <- cutree(hc_upgma, k = 3)
print(table(hc_clusters))
cutree() cuts the dendrogram at a specified height and returns cluster membership.
K-means Clustering
Hierarchical clustering doesn't require pre-specifying the number of clusters, but when you have a prior estimate, K-means is faster.
Step 1: Dimensionality Reduction
K-means performs poorly on high-dimensional species data. Reduce dimensions first:
dist_bc <- vegdist(df, method = "bray")
pcoa <- cmdscale(dist_bc, k = 4, eig = TRUE)
Step 2: K-means
set.seed(123)
km <- kmeans(pcoa$points, centers = 3, nstart = 50)
km$cluster
Parameters:
-
centers = 3: Number of clusters -
nstart = 50: 50 random initializations, takes the best result -
set.seed(123): Ensures reproducibility
Step 3: Visualization
library(factoextra)
fviz_cluster(km, data = pcoa$points,
ellipse.type = "norm",
palette = c("#E64B35", "#4DBBD5", "#00A087"),
ggtheme = theme_bw(),
main = "K-means Clustering (k = 3)",
labelsize = 0) +
labs(color = "Cluster", shape = "Cluster")
fviz_cluster() projects data onto the first two principal coordinates, with ellipse.type = "norm" adding 95% confidence ellipses.
Results
The dendrogram and K-means plot clearly show:
Three distinct community types — The three main branches in the dendrogram correspond to the three simulated community types.
UPGMA and K-means agree — Both methods produce highly consistent classification results, approaching the problem from different angles.
-
Applications:
- Classifying community types based on species composition
- Validating predefined habitat/treatment groupings
- Providing categorical variables for downstream ordination (NMDS, RDA)
Complete Runable Script
# ============================================================
# Complete workflow: Hierarchical + K-means clustering
# ============================================================
# First run: install.packages(c("vegan", "factoextra"))
library(vegan)
library(factoextra)
df <- read.csv("community_data.csv", header = TRUE, row.names = 1)
# Hierarchical clustering
dist_jac <- vegdist(df, method = "jaccard")
hc_upgma <- hclust(dist_jac, method = "average")
pdf("dendrogram.pdf", width = 12, height = 7)
par(mar = c(4, 4, 3, 8))
plot(hc_upgma, hang = -1, main = "Hierarchical Clustering (UPGMA, Jaccard)",
xlab = "Sample", ylab = "Jaccard Distance", sub = "", cex = 0.7)
rect.hclust(hc_upgma, k = 3, border = c("#E64B35", "#4DBBD5", "#00A087"))
dev.off()
pdf("dendrogram_horizontal.pdf", width = 8, height = 10)
par(mar = c(4, 4, 2, 2))
plot(as.dendrogram(hc_upgma), horiz = TRUE,
main = "Hierarchical Clustering (UPGMA, Jaccard)")
dev.off()
hc_clusters <- cutree(hc_upgma, k = 3)
# K-means
dist_bc <- vegdist(df, method = "bray")
pcoa <- cmdscale(dist_bc, k = 4, eig = TRUE)
set.seed(123)
km <- kmeans(pcoa$points, centers = 3, nstart = 50)
pdf("kmeans_clusters.pdf", width = 8, height = 6)
fviz_cluster(km, data = pcoa$points,
ellipse.type = "norm",
palette = c("#E64B35", "#4DBBD5", "#00A087"),
ggtheme = theme_bw(),
main = "K-means Clustering (k = 3)",
labelsize = 0) +
labs(color = "Cluster", shape = "Cluster")
dev.off()
# Compare results
data.frame(Sample = rownames(df),
Hierarchical = paste0("G", hc_clusters),
Kmeans = paste0("G", km$cluster))
To use with your own data:
- Prepare a CSV file (rows = samples, columns = species, values = abundance)
- Replace
community_data.csvwith your file name - Adjust
k = 3to your expected number of clusters
Quick Tips
- Use Jaccard distance for presence/absence or abundance data with many zeros
- Use Bray-Curtis for raw abundance data
- Always set
nstart >= 50for K-means to avoid local optima - Cross-validate clustering results with two different methods
- Avoid pure numbers as sample names (use "Plot1" instead of "1")
Follow me on Dev.to for more R tutorials.




Top comments (0)