UMAP: a faster map of high-dimensional data (and how it beats t-SNE)

#ai #machinelearning #python #datascience

Yesterday I built t-SNE from scratch and watched six-dimensional blobs untangle into a clean 2D picture. It works, and the clusters it draws are gorgeous. But two things nagged me. It was slow — every iteration touched all N-squared pairs of points. And it threw away the big picture: the distance between two clusters on a t-SNE plot means nothing at all. So today I built its successor, UMAP, and it fixes both problems while keeping the crisp local clusters.

The mental shift: from probabilities to a graph

t-SNE thinks in probabilities. For every pair of points it computes "how likely am I to pick you as a neighbour," builds a dense matrix of those numbers, and then nudges 2D points around until a second matrix matches the first.

UMAP thinks in topology. It assumes your data lies on some curved surface — a manifold — and it approximates that surface with a graph. Connect each point to its k nearest neighbours. Give every connection a weight between 0 and 1 that says how strongly those two points belong together. That weighted graph is the whole model. UMAP's job is then to draw the same graph in 2D so that connected points stay close and everything else drifts apart.

That reframing is where the speed comes from. A graph with k neighbours per point is sparse — you never build the giant dense matrix at all.

Building the high-dimensional graph

Three ideas do the heavy lifting, and I coded each one in plain JavaScript.

First, local connectivity. Raw neighbour distances are unfair: a point in a crowded region has close neighbours, a point in a sparse region has far ones. So for each point UMAP subtracts rho, the distance to its single nearest neighbour. This guarantees everybody is fully connected to at least one other point — nothing gets orphaned.

Second, a per-point bandwidth sigma. I binary-search sigma so the smooth membership values in each row sum to log2(n_neighbors). This adapts the graph to local density, exactly the way t-SNE tunes its Gaussian width to hit a target perplexity.

Third, symmetrise. Point i's opinion of j rarely equals j's opinion of i. UMAP fuses the two directed weights with the fuzzy union a + b - a*b. That gives one symmetric strength per edge, and the finished weighted graph is UMAP's high-dimensional target.

The two knobs that matter

n_neighbors is the local-versus-global dial. Small values (say 5) make UMAP believe only its closest points, capturing fine detail but fragmenting the overall shape. Large values (say 50) let each point see far across the manifold, preserving more global structure. It is the direct analogue of t-SNE's perplexity.

min_dist controls how tightly clusters pack. In 2D, UMAP measures similarity with the curve 1/(1 + a*d^(2b)), and it least-squares fits a and b so the curve stays flat until distance reaches min_dist, then decays. Small min_dist lets neighbours collapse almost on top of each other into dense clumps; large min_dist spreads each cluster out for a more readable plot. In my demo you can drag both sliders and watch the layout re-fit live.

Drawing it: attract, repel, decay

The layout is a force-directed graph relaxation trained by stochastic gradient descent on the cross-entropy between the high-D and low-D graphs. Each epoch:

Sample edges — a stronger edge fires more often — and pull its two endpoints together.
For each pull, do a few negative samples: grab random points and push them apart.
Decay the learning rate toward zero so the layout freezes.

Attract along real edges, repel random pairs. That is the entire optimiser, and because it works on a sparse graph with sampling instead of a dense matrix, it scales roughly linearly where t-SNE scaled quadratically per step.

I verified the interactive demo in Node with a fixed seed: a between-cluster / within-cluster separation score climbs from about 0.34 (one mixed blob) to past 25 after 500 epochs. The four true clusters genuinely snap into separate islands, and they do it fast.

What it does and does not tell you

UMAP preserves local neighbourhoods beautifully and keeps global structure better than t-SNE, so the arrangement of clusters carries some real signal. But it is still not a metric embedding — exact distances, cluster sizes, and gap sizes are only approximate, and they shift with your two knobs. Read it as "who groups together, and roughly how the groups relate," never as a ruler. As always, trust structure that survives across several parameter settings.

One more win over t-SNE: a fitted UMAP has a .transform() method, so it can project brand-new points onto an existing map. That makes it a genuine preprocessing step, not just a one-off picture — which is why it has become the default embedding plot.

Play with the live version — build the graph, drag the sliders, watch it relax: https://dev48v.infy.uk/ml/day24-umap.html