DEV Community

I Want To Learn Programming
I Want To Learn Programming

Posted on • Originally published at iwtlp.com

k-means from scratch, and why it sometimes lies

k-means is usually the first clustering algorithm anyone learns: hand it data and a number k, and it splits the data into k groups. It's two simple steps repeated until they settle, and you can write it from scratch in a few lines of R. But it has two honest failure modes that the tutorials skip, and knowing them is the difference between using it well and being fooled by it.

The one idea: assign, then update, repeat

k-means alternates two steps until nothing changes:

  1. Assign: put each point in the cluster whose center (centroid) is nearest.
  2. Update: move each centroid to the mean of the points now assigned to it.

That's it. Assign points to the nearest center; move each center to the middle of its points; repeat. Each round can only reduce the total within-cluster distance, so it converges. Here it is in R:

kmeans_scratch <- function(X, k, iters = 20) {
  centroids <- X[sample(nrow(X), k), , drop = FALSE]   # start: k random points
  cluster <- rep(0, nrow(X))

  for (it in seq_len(iters)) {
    # --- assign: nearest centroid for each point ---
    for (i in seq_len(nrow(X))) {
      dists <- apply(centroids, 1, function(c) sum((X[i, ] - c)^2))
      cluster[i] <- which.min(dists)
    }
    # --- update: each centroid becomes the mean of its points ---
    for (j in seq_len(k)) {
      if (any(cluster == j)) centroids[j, ] <- colMeans(X[cluster == j, , drop = FALSE])
    }
  }
  list(cluster = cluster, centroids = centroids)
}

set.seed(1)
X <- rbind(matrix(rnorm(40, mean = 0), ncol = 2),
           matrix(rnorm(40, mean = 5), ncol = 2))   # two real blobs
res <- kmeans_scratch(X, k = 2)
res$centroids        # ~ (0,0) and (5,5)
Enter fullscreen mode Exit fullscreen mode

On two well-separated blobs it nails them: the centroids land near the true centers. Two details that matter:

  • We use squared distance. sum((X[i,] - c)^2) is squared Euclidean distance. We skip the square root because we only need to know which centroid is nearest, and sqrt doesn't change the ordering, a small, free speedup.
  • The update is just colMeans. "Move the centroid to the middle of its points" is literally the mean of each coordinate. That's why it's called k-means.

Lie #1: the answer depends on where you start

The centroids start at random points. Run k-means twice with different random starts and you can get different clusterings, because the assign/update loop only finds a local optimum, the nearest stable configuration to wherever it began, not necessarily the best one. On easy data it doesn't matter; on harder data, one start finds clean clusters and another splits a real group in half.

The standard fix is nstart: run it many times from different random starts and keep the result with the lowest total within-cluster distance.

# real R: try 25 starts, keep the best
km <- kmeans(X, centers = 2, nstart = 25)
Enter fullscreen mode Exit fullscreen mode

If you ever rely on k-means with a single start, you're trusting a coin flip. Always use multiple starts.

Lie #2: it always finds k clusters, even when there are none

This is the dangerous one. k-means cannot say "there are no clusters." Ask for k = 4 and it will partition your data into 4 groups, even if the data is one uniform cloud with no structure at all. It returns confident-looking clusters that mean nothing.

Two guards:

  • Choose k honestly. Don't assume it. Use the elbow method (plot total within-cluster distance as k increases, look for the bend where adding clusters stops helping) or the silhouette score. If there's no elbow, that's a hint there are no natural clusters.
  • Know the shape assumption. k-means assumes clusters are roughly round and similar-sized, because it judges everything by distance to a center. Give it two long crescents, or one big and one tiny group, and it carves them wrong. For non-round structure, density-based methods (DBSCAN) fit better.

Why this is worth building

Writing k-means yourself makes both lies obvious instead of surprising. You see that the centroids start random (so results vary), and that the loop must assign every point to some cluster (so it can't refuse to find groups). Those aren't bugs, they're direct consequences of the two-step algorithm, visible the moment you've coded it.

That's the real value of from-scratch implementation: you learn not just how an algorithm works, but how it fails, which is what separates someone who runs kmeans() from someone who can trust, or distrust, its output. Building the classics this way, and meeting their honest limits, is the whole approach of the machine learning in R track.

Top comments (0)