DEV Community

peisen yan
peisen yan

Posted on

Cluster Analysis in R: Hierarchical Clustering & K-means

Cluster Analysis in R: Hierarchical Clustering & K-means

Samples with similar species composition often reflect similar environmental conditions. Cluster analysis quantifies this similarity and helps ecologists understand community structure.

This tutorial demonstrates a complete workflow from distance calculation to hierarchical clustering and K-means clustering using simulated community data (30 plots × 20 species).

Data Preparation

The dataset contains 30 plots and 20 species, designed with 3 community types (10 plots each):

Type Plots Dominant Species Characteristic
Type A Plot01-10 Sp01-Sp08 Higher abundance of species 1-8
Type B Plot11-20 Sp09-Sp15 Higher abundance of species 9-15
Type C Plot21-30 Sp16-Sp20 Higher abundance of species 16-20


# Read data
df <- read.csv("community_data.csv", header = TRUE, row.names = 1)
head(df[, 1:6])
Enter fullscreen mode Exit fullscreen mode

Hierarchical Clustering

Step 1: Calculate Distance Matrix

For species abundance data, Jaccard distance or Bray-Curtis distance are most commonly used.

library(vegan)
dist_jac <- vegdist(df, method = "jaccard")
Enter fullscreen mode Exit fullscreen mode

Step 2: UPGMA Clustering

hc_upgma <- hclust(dist_jac, method = "average")
Enter fullscreen mode Exit fullscreen mode

method = "average" is UPGMA (Unweighted Pair Group Method with Arithmetic Mean). Other options:

Method Description
"average" UPGMA, most common, balanced branches
"ward.D2" Ward's minimum variance, compact clusters
"complete" Complete linkage, based on farthest points
"single" Single linkage, based on nearest points

Step 3: Plot Dendrogram

# Vertical dendrogram
par(mar = c(4, 4, 3, 8))
plot(hc_upgma, hang = -1, main = "Hierarchical Clustering (UPGMA, Jaccard)",
     xlab = "Sample", ylab = "Jaccard Distance", sub = "", cex = 0.7)
rect.hclust(hc_upgma, k = 3, border = c("#E64B35", "#4DBBD5", "#00A087"))
Enter fullscreen mode Exit fullscreen mode

Parameter notes:

  • hang = -1: Aligns all branch tips to the same horizontal line
  • rect.hclust(k = 3): Draws rectangles around 3 clusters
  • cex = 0.7: Adjusts label font size


# Horizontal dendrogram (better for many samples)
plot(as.dendrogram(hc_upgma), horiz = TRUE,
     main = "Hierarchical Clustering (UPGMA, Jaccard)")
Enter fullscreen mode Exit fullscreen mode

Step 4: Cut the Tree

hc_clusters <- cutree(hc_upgma, k = 3)
print(table(hc_clusters))
Enter fullscreen mode Exit fullscreen mode

cutree() cuts the dendrogram at a specified height and returns cluster membership.

K-means Clustering

Hierarchical clustering doesn't require pre-specifying the number of clusters, but when you have a prior estimate, K-means is faster.

Step 1: Dimensionality Reduction

K-means performs poorly on high-dimensional species data. Reduce dimensions first:

dist_bc <- vegdist(df, method = "bray")
pcoa <- cmdscale(dist_bc, k = 4, eig = TRUE)
Enter fullscreen mode Exit fullscreen mode

Step 2: K-means

set.seed(123)
km <- kmeans(pcoa$points, centers = 3, nstart = 50)
km$cluster
Enter fullscreen mode Exit fullscreen mode

Parameters:

  • centers = 3: Number of clusters
  • nstart = 50: 50 random initializations, takes the best result
  • set.seed(123): Ensures reproducibility

Step 3: Visualization

library(factoextra)
fviz_cluster(km, data = pcoa$points,
             ellipse.type = "norm",
             palette = c("#E64B35", "#4DBBD5", "#00A087"),
             ggtheme = theme_bw(),
             main = "K-means Clustering (k = 3)",
             labelsize = 0) +
  labs(color = "Cluster", shape = "Cluster")
Enter fullscreen mode Exit fullscreen mode

fviz_cluster() projects data onto the first two principal coordinates, with ellipse.type = "norm" adding 95% confidence ellipses.

Results

The dendrogram and K-means plot clearly show:

  1. Three distinct community types — The three main branches in the dendrogram correspond to the three simulated community types.

  2. UPGMA and K-means agree — Both methods produce highly consistent classification results, approaching the problem from different angles.

  3. Applications:

    • Classifying community types based on species composition
    • Validating predefined habitat/treatment groupings
    • Providing categorical variables for downstream ordination (NMDS, RDA)

Complete Runable Script

# ============================================================
# Complete workflow: Hierarchical + K-means clustering
# ============================================================
# First run: install.packages(c("vegan", "factoextra"))

library(vegan)
library(factoextra)

df <- read.csv("community_data.csv", header = TRUE, row.names = 1)

# Hierarchical clustering
dist_jac <- vegdist(df, method = "jaccard")
hc_upgma <- hclust(dist_jac, method = "average")

pdf("dendrogram.pdf", width = 12, height = 7)
par(mar = c(4, 4, 3, 8))
plot(hc_upgma, hang = -1, main = "Hierarchical Clustering (UPGMA, Jaccard)",
     xlab = "Sample", ylab = "Jaccard Distance", sub = "", cex = 0.7)
rect.hclust(hc_upgma, k = 3, border = c("#E64B35", "#4DBBD5", "#00A087"))
dev.off()

pdf("dendrogram_horizontal.pdf", width = 8, height = 10)
par(mar = c(4, 4, 2, 2))
plot(as.dendrogram(hc_upgma), horiz = TRUE,
     main = "Hierarchical Clustering (UPGMA, Jaccard)")
dev.off()

hc_clusters <- cutree(hc_upgma, k = 3)

# K-means
dist_bc <- vegdist(df, method = "bray")
pcoa <- cmdscale(dist_bc, k = 4, eig = TRUE)
set.seed(123)
km <- kmeans(pcoa$points, centers = 3, nstart = 50)

pdf("kmeans_clusters.pdf", width = 8, height = 6)
fviz_cluster(km, data = pcoa$points,
             ellipse.type = "norm",
             palette = c("#E64B35", "#4DBBD5", "#00A087"),
             ggtheme = theme_bw(),
             main = "K-means Clustering (k = 3)",
             labelsize = 0) +
  labs(color = "Cluster", shape = "Cluster")
dev.off()

# Compare results
data.frame(Sample = rownames(df),
           Hierarchical = paste0("G", hc_clusters),
           Kmeans = paste0("G", km$cluster))
Enter fullscreen mode Exit fullscreen mode

To use with your own data:

  1. Prepare a CSV file (rows = samples, columns = species, values = abundance)
  2. Replace community_data.csv with your file name
  3. Adjust k = 3 to your expected number of clusters

Quick Tips

  • Use Jaccard distance for presence/absence or abundance data with many zeros
  • Use Bray-Curtis for raw abundance data
  • Always set nstart >= 50 for K-means to avoid local optima
  • Cross-validate clustering results with two different methods
  • Avoid pure numbers as sample names (use "Plot1" instead of "1")

Follow me on Dev.to for more R tutorials.

Top comments (0)