Hierarchical clustering: merge nearest, read the dendrogram

#machinelearning #datascience #python #ai

K-Means has one annoying question it asks before you've even looked at your data: how many clusters? Pick k=3 and it will happily carve the world into three, whether or not three is the truth. Hierarchical clustering flips that around. Instead of committing to a number, it builds the entire family of clusterings at once and lets you choose afterward, by reading a picture.

I built a from-scratch, runs-in-your-browser version here:

🌳 https://dev48v.infy.uk/ml/day25-hierarchical-clustering.html

The idea: keep merging the two closest

The common flavour is agglomerative (bottom-up), and the loop is almost embarrassingly simple:

Start with every point as its own cluster. With 24 points, that's 24 clusters.
Find the two closest clusters and fuse them into one. Write down the distance at which they joined.
Repeat until a single cluster remains.

That's n−1 merges, and the list of "who merged, and at what height" is the whole model. There's a top-down cousin, divisive, that starts with one big cluster and recursively splits it, but it's much more expensive, so almost nobody uses it.

Linkage: what does "closest clusters" even mean?

Distance between two points is obvious. Distance between two clusters is a choice, and it changes everything:

single — the distance between the two closest members. Great for long, snaky, non-convex shapes, but suffers "chaining": one stray bridge of points can glue two separate blobs into one straggly mess.
complete — the distance between the two farthest members. Produces tight, compact, evenly sized clusters and resists chaining.
average — the mean over all cross-cluster pairs. A sensible middle ground, popular in biology (UPGMA).
Ward — merge the pair that increases the total within-cluster variance the least. Usually the cleanest result for roughly round blobs, and the default I'd reach for first.

Same data, different linkage, genuinely different trees. The demo lets you flip between all four and watch the dendrogram rearrange.

The dendrogram: a tree you can actually read

Every merge becomes a horizontal bar, and the bar's height is the distance at which those two groups joined. Low bars mean "these were very similar." Tall bars mean "we just forced two dissimilar groups together." So the tall vertical gaps in the tree are exactly where the natural cluster boundaries live.

Cutting the tree

Here's the payoff. A dendrogram secretly contains every possible number of clusters. You pick one by drawing a horizontal cut line: everything below the cut collapses into a flat cluster, everything above stays separate. Cut low and you get many tiny clusters; cut high and they merge into a few big ones.

In the demo the scatter plot and the dendrogram are linked — drag the cut line and the points instantly recolour to match the flat clusters at that height. Slide it down to the biggest gap and three blobs snap out cleanly. That's the whole trick: choose k after seeing the structure, not before.

Where should you cut? Three honest answers: cut at the tallest vertical gap (the biggest jump between consecutive merge heights), cut to a target number of clusters if you already know it, or score a few candidate cuts with the silhouette coefficient and keep the best.

Why bother vs K-Means?

No k up front — you get the full nesting and decide later.
Deterministic. No random restarts, same answer every run.
With single linkage it can trace shapes (chains, rings) that K-Means can never separate.

The catch is cost: it's O(n²) memory for the distance matrix and up to O(n³) time for the naive merge loop. That's fine for hundreds to a few thousand points and impractical for millions — for those, fall back to K-Means, DBSCAN, or BIRCH, or run hierarchical on a sample.

In real code

You'd never hand-roll it in production. SciPy does the whole thing:

from scipy.cluster.hierarchy import linkage, dendrogram, fcluster

Z = linkage(X, method="ward")          # the full merge tree
dendrogram(Z)                          # draw it
labels = fcluster(Z, t=3, criterion="maxclust")   # cut into 3 clusters

The page walks it in three tabs: LOOK (the live scatter + dendrogram with the cut slider), UNDERSTAND (ten steps from the merge loop to the four linkages to complexity), and BUILD (the ~70 lines of vanilla JS behind it, then the SciPy version).

Have a play with the linkage selector and the cut line — single vs Ward on the same points is the fastest way to feel why linkage matters:

🌳 https://dev48v.infy.uk/ml/day25-hierarchical-clustering.html

Part of MachineLearningFromZero. 🌐 https://dev48v.infy.uk