k-means from scratch, and why it sometimes lies

#machinelearning #clustering #r #kmeans

k-means is usually the first clustering algorithm anyone learns: hand it data and a number k, and it splits the data into k groups. It's two simple steps repeated until they settle, and you can write it from scratch in a few lines of R. But it has two honest failure modes that the tutorials skip, and knowing them is the difference between using it well and being fooled by it.

The one idea: assign, then update, repeat

k-means alternates two steps until nothing changes:

Assign: put each point in the cluster whose center (centroid) is nearest.
Update: move each centroid to the mean of the points now assigned to it.

That's it. Assign points to the nearest center; move each center to the middle of its points; repeat. Each round can only reduce the total within-cluster distance, so it converges. Here it is in R:

kmeans_scratch <- function(X, k, iters = 20) {
  centroids <- X[sample(nrow(X), k), , drop = FALSE]   # start: k random points
  cluster <- rep(0, nrow(X))

  for (it in seq_len(iters)) {
    # --- assign: nearest centroid for each point ---
    for (i in seq_len(nrow(X))) {
      dists <- apply(centroids, 1, function(c) sum((X[i, ] - c)^2))
      cluster[i] <- which.min(dists)
    }
    # --- update: each centroid becomes the mean of its points ---
    for (j in seq_len(k)) {
      if (any(cluster == j)) centroids[j, ] <- colMeans(X[cluster == j, , drop = FALSE])
    }
  }
  list(cluster = cluster, centroids = centroids)
}

set.seed(1)
X <- rbind(matrix(rnorm(40, mean = 0), ncol = 2),
           matrix(rnorm(40, mean = 5), ncol = 2))   # two real blobs
res <- kmeans_scratch(X, k = 2)
res$centroids        # ~ (0,0) and (5,5)

On two well-separated blobs it nails them: the centroids land near the true centers. Two details that matter:

We use squared distance. sum((X[i,] - c)^2) is squared Euclidean distance. We skip the square root because we only need to know which centroid is nearest, and sqrt doesn't change the ordering, a small, free speedup.
The update is just colMeans. "Move the centroid to the middle of its points" is literally the mean of each coordinate. That's why it's called k-means.

Lie #1: the answer depends on where you start

The centroids start at random points. Run k-means twice with different random starts and you can get different clusterings, because the assign/update loop only finds a local optimum, the nearest stable configuration to wherever it began, not necessarily the best one. On easy data it doesn't matter; on harder data, one start finds clean clusters and another splits a real group in half.

The standard fix is nstart: run it many times from different random starts and keep the result with the lowest total within-cluster distance.

# real R: try 25 starts, keep the best
km <- kmeans(X, centers = 2, nstart = 25)

If you ever rely on k-means with a single start, you're trusting a coin flip. Always use multiple starts.

Lie #2: it always finds k clusters, even when there are none

This is the dangerous one. k-means cannot say "there are no clusters." Ask for k = 4 and it will partition your data into 4 groups, even if the data is one uniform cloud with no structure at all. It returns confident-looking clusters that mean nothing.

Two guards:

Choose k honestly. Don't assume it. Use the elbow method (plot total within-cluster distance as k increases, look for the bend where adding clusters stops helping) or the silhouette score. If there's no elbow, that's a hint there are no natural clusters.
Know the shape assumption. k-means assumes clusters are roughly round and similar-sized, because it judges everything by distance to a center. Give it two long crescents, or one big and one tiny group, and it carves them wrong. For non-round structure, density-based methods (DBSCAN) fit better.

Why this is worth building

Writing k-means yourself makes both lies obvious instead of surprising. You see that the centroids start random (so results vary), and that the loop must assign every point to some cluster (so it can't refuse to find groups). Those aren't bugs, they're direct consequences of the two-step algorithm, visible the moment you've coded it.

That's the real value of from-scratch implementation: you learn not just how an algorithm works, but how it fails, which is what separates someone who runs kmeans() from someone who can trust, or distrust, its output. Building the classics this way, and meeting their honest limits, is the whole approach of the machine learning in R track.

Top comments (2)

Vic Chen • Jun 30

Nice breakdown. I especially liked the way you highlighted the two failure modes people usually skip: random initialization leading to different local optima, and the mismatch between k-means assumptions and non-spherical structure. As someone building AI/data products, I think that kind of intuition matters more than just calling a library function. Solid post.

Vic Chen • Jul 1

This is a clean explanation of why k-means is useful and where it can quietly mislead people. I especially liked the framing around “honest failure modes” — that’s the part many beginner ML tutorials skip. Writing it from scratch also makes the assign/update loop much more intuitive than jumping straight into a library call. Posts like this are great for building real intuition before people move on to more complex clustering methods.