peisen yan

Posted on Jun 10

Cluster Analysis in R: Hierarchical Clustering & K-means

#r #tutorial #datascience #ecology

Cluster Analysis in R: Hierarchical Clustering & K-means

Samples with similar species composition often reflect similar environmental conditions. Cluster analysis quantifies this similarity and helps ecologists understand community structure.

This tutorial demonstrates a complete workflow from distance calculation to hierarchical clustering and K-means clustering using simulated community data (30 plots × 20 species).

Data Preparation

The dataset contains 30 plots and 20 species, designed with 3 community types (10 plots each):

Type	Plots	Dominant Species	Characteristic
Type A	Plot01-10	Sp01-Sp08	Higher abundance of species 1-8
Type B	Plot11-20	Sp09-Sp15	Higher abundance of species 9-15
Type C	Plot21-30	Sp16-Sp20	Higher abundance of species 16-20

# Read data
df <- read.csv("community_data.csv", header = TRUE, row.names = 1)
head(df[, 1:6])

Hierarchical Clustering

Step 1: Calculate Distance Matrix

For species abundance data, Jaccard distance or Bray-Curtis distance are most commonly used.

library(vegan)
dist_jac <- vegdist(df, method = "jaccard")

Step 2: UPGMA Clustering

hc_upgma <- hclust(dist_jac, method = "average")

method = "average" is UPGMA (Unweighted Pair Group Method with Arithmetic Mean). Other options:

Method	Description
`"average"`	UPGMA, most common, balanced branches
`"ward.D2"`	Ward's minimum variance, compact clusters
`"complete"`	Complete linkage, based on farthest points
`"single"`	Single linkage, based on nearest points

Step 3: Plot Dendrogram

# Vertical dendrogram
par(mar = c(4, 4, 3, 8))
plot(hc_upgma, hang = -1, main = "Hierarchical Clustering (UPGMA, Jaccard)",
     xlab = "Sample", ylab = "Jaccard Distance", sub = "", cex = 0.7)
rect.hclust(hc_upgma, k = 3, border = c("#E64B35", "#4DBBD5", "#00A087"))

Parameter notes:

hang = -1: Aligns all branch tips to the same horizontal line
rect.hclust(k = 3): Draws rectangles around 3 clusters
cex = 0.7: Adjusts label font size

# Horizontal dendrogram (better for many samples)
plot(as.dendrogram(hc_upgma), horiz = TRUE,
     main = "Hierarchical Clustering (UPGMA, Jaccard)")

Step 4: Cut the Tree

hc_clusters <- cutree(hc_upgma, k = 3)
print(table(hc_clusters))

cutree() cuts the dendrogram at a specified height and returns cluster membership.

K-means Clustering

Hierarchical clustering doesn't require pre-specifying the number of clusters, but when you have a prior estimate, K-means is faster.

Step 1: Dimensionality Reduction

K-means performs poorly on high-dimensional species data. Reduce dimensions first:

dist_bc <- vegdist(df, method = "bray")
pcoa <- cmdscale(dist_bc, k = 4, eig = TRUE)

Step 2: K-means

set.seed(123)
km <- kmeans(pcoa$points, centers = 3, nstart = 50)
km$cluster

Parameters:

centers = 3: Number of clusters
nstart = 50: 50 random initializations, takes the best result
set.seed(123): Ensures reproducibility

Step 3: Visualization

library(factoextra)
fviz_cluster(km, data = pcoa$points,
             ellipse.type = "norm",
             palette = c("#E64B35", "#4DBBD5", "#00A087"),
             ggtheme = theme_bw(),
             main = "K-means Clustering (k = 3)",
             labelsize = 0) +
  labs(color = "Cluster", shape = "Cluster")

fviz_cluster() projects data onto the first two principal coordinates, with ellipse.type = "norm" adding 95% confidence ellipses.

Results

The dendrogram and K-means plot clearly show:

Three distinct community types — The three main branches in the dendrogram correspond to the three simulated community types.
UPGMA and K-means agree — Both methods produce highly consistent classification results, approaching the problem from different angles.
Applications:
- Classifying community types based on species composition
- Validating predefined habitat/treatment groupings
- Providing categorical variables for downstream ordination (NMDS, RDA)

Complete Runable Script

# ============================================================
# Complete workflow: Hierarchical + K-means clustering
# ============================================================
# First run: install.packages(c("vegan", "factoextra"))

library(vegan)
library(factoextra)

df <- read.csv("community_data.csv", header = TRUE, row.names = 1)

# Hierarchical clustering
dist_jac <- vegdist(df, method = "jaccard")
hc_upgma <- hclust(dist_jac, method = "average")

pdf("dendrogram.pdf", width = 12, height = 7)
par(mar = c(4, 4, 3, 8))
plot(hc_upgma, hang = -1, main = "Hierarchical Clustering (UPGMA, Jaccard)",
     xlab = "Sample", ylab = "Jaccard Distance", sub = "", cex = 0.7)
rect.hclust(hc_upgma, k = 3, border = c("#E64B35", "#4DBBD5", "#00A087"))
dev.off()

pdf("dendrogram_horizontal.pdf", width = 8, height = 10)
par(mar = c(4, 4, 2, 2))
plot(as.dendrogram(hc_upgma), horiz = TRUE,
     main = "Hierarchical Clustering (UPGMA, Jaccard)")
dev.off()

hc_clusters <- cutree(hc_upgma, k = 3)

# K-means
dist_bc <- vegdist(df, method = "bray")
pcoa <- cmdscale(dist_bc, k = 4, eig = TRUE)
set.seed(123)
km <- kmeans(pcoa$points, centers = 3, nstart = 50)

pdf("kmeans_clusters.pdf", width = 8, height = 6)
fviz_cluster(km, data = pcoa$points,
             ellipse.type = "norm",
             palette = c("#E64B35", "#4DBBD5", "#00A087"),
             ggtheme = theme_bw(),
             main = "K-means Clustering (k = 3)",
             labelsize = 0) +
  labs(color = "Cluster", shape = "Cluster")
dev.off()

# Compare results
data.frame(Sample = rownames(df),
           Hierarchical = paste0("G", hc_clusters),
           Kmeans = paste0("G", km$cluster))

To use with your own data:

Prepare a CSV file (rows = samples, columns = species, values = abundance)
Replace community_data.csv with your file name
Adjust k = 3 to your expected number of clusters

Quick Tips

Use Jaccard distance for presence/absence or abundance data with many zeros
Use Bray-Curtis for raw abundance data
Always set nstart >= 50 for K-means to avoid local optima
Cross-validate clustering results with two different methods
Avoid pure numbers as sample names (use "Plot1" instead of "1")

Follow me on Dev.to for more R tutorials.

DEV Community

Cluster Analysis in R: Hierarchical Clustering & K-means

Cluster Analysis in R: Hierarchical Clustering & K-means

Data Preparation

Hierarchical Clustering

Step 1: Calculate Distance Matrix

Step 2: UPGMA Clustering

Step 3: Plot Dendrogram

Step 4: Cut the Tree

K-means Clustering

Step 1: Dimensionality Reduction

Step 2: K-means

Step 3: Visualization

Results

Complete Runable Script

Quick Tips

Top comments (0)