You have a pile of unlabelled data and you want to find groups in it. K-Means is the algorithm everyone reaches for first — it's fast, simple, and usually works. But it makes a bold assumption: every data point belongs to exactly one cluster. No uncertainty, no hedging.
What happens to a point sitting right between two clusters? K-Means forces a choice. Gaussian Mixture Models (GMMs) offer an alternative — soft assignments that express how uncertain we are. By the end of this post, you'll implement K-Means from scratch, see why it's secretly a special case of the EM algorithm, and understand exactly when soft clustering beats hard clustering.
Quick Win: K-Means from Scratch
Let's cluster some data. Click the badge to open the interactive notebook:
Here's K-Means in 30 lines — Lloyd's algorithm:
import numpy as np
def kmeans(X, K, max_iter=100, seed=42):
"""
K-Means clustering (Lloyd's algorithm).
Args:
X: (N, D) array of data points
K: number of clusters
max_iter: maximum iterations
seed: random seed for reproducibility
Returns:
centroids: (K, D) final cluster centres
labels: (N,) hard cluster assignments
history: list of (centroids, labels) at each iteration
distortions: list of objective function values
"""
rng = np.random.default_rng(seed)
N, D = X.shape
# Step 1: Initialise centroids by picking K random data points
indices = rng.choice(N, K, replace=False)
centroids = X[indices].copy()
history = []
distortions = []
for iteration in range(max_iter):
# Step 2: ASSIGN — each point to nearest centroid
distances = np.linalg.norm(X[:, None] - centroids[None, :], axis=2)
labels = np.argmin(distances, axis=1)
# Compute distortion (objective function)
distortion = sum(np.sum((X[labels == k] - centroids[k])**2) for k in range(K))
distortions.append(distortion)
history.append((centroids.copy(), labels.copy()))
# Step 3: UPDATE — move centroids to cluster means
new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(K)])
# Check convergence
if np.allclose(centroids, new_centroids):
break
centroids = new_centroids
return centroids, labels, history, distortions
Run it on three synthetic clusters:
import matplotlib.pyplot as plt
rng = np.random.default_rng(42)
n_per_cluster = 100
cluster_1 = rng.multivariate_normal([2, 2], [[0.8, 0.4], [0.4, 0.3]], n_per_cluster)
cluster_2 = rng.multivariate_normal([7, 7], [[0.5, -0.3], [-0.3, 0.8]], n_per_cluster)
cluster_3 = rng.multivariate_normal([2, 8], [[0.3, 0.0], [0.0, 0.6]], n_per_cluster)
X = np.vstack([cluster_1, cluster_2, cluster_3])
centroids, labels, history, distortions = kmeans(X, K=3)
colours = ['#e74c3c', '#3498db', '#2ecc71']
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
ax = axes[0]
for k in range(3):
mask = labels == k
ax.scatter(X[mask, 0], X[mask, 1], c=colours[k], alpha=0.5, s=20, label=f'Cluster {k+1}')
ax.scatter(centroids[:, 0], centroids[:, 1], c='black', marker='X', s=200, zorder=5, label='Centroids')
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')
ax.set_title('K-Means Result (K=3)')
ax.legend(); ax.grid(True, alpha=0.3)
ax = axes[1]
ax.plot(range(1, len(distortions) + 1), distortions, 'k-o', markersize=6)
ax.set_xlabel('Iteration'); ax.set_ylabel('Distortion $J$')
ax.set_title('K-Means Objective Decreases at Each Step')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
The result: K-Means finds three clean clusters in about 5 iterations. The distortion (sum of squared distances to centroids) drops sharply and converges. Each point is assigned to exactly one cluster — no ambiguity.
What Just Happened?
K-Means repeats two steps until nothing changes:
Step 1: Assign Points to Nearest Centroid
distances = np.linalg.norm(X[:, None] - centroids[None, :], axis=2)
labels = np.argmin(distances, axis=1)
For each data point, compute the Euclidean distance to every centroid and pick the closest one. This creates hard responsibilities — each point gets a label of 0, 1, or 2, with no in-between.
Step 2: Move Centroids to Cluster Means
new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(K)])
Each centroid moves to the average position of all points assigned to it. Think of it as a "centre of gravity" — the centroid is pulled toward its members.
Why It Converges
Each iteration can only decrease (or maintain) the distortion. Assigning a point to a closer centroid reduces the sum of squared distances. Moving a centroid to the cluster mean minimises the within-cluster variance. Since distortion is bounded below by zero and decreases monotonically, the algorithm must converge.
Let's watch this happen step by step:
The centroids start at random data points and quickly settle into the cluster centres. By iteration 3-4, the assignments are essentially fixed.
from matplotlib.animation import FuncAnimation, PillowWriter
fig, ax = plt.subplots(figsize=(8, 6))
def update(frame):
ax.clear()
c, l = history[frame]
for k in range(3):
mask = l == k
ax.scatter(X[mask, 0], X[mask, 1], c=colours[k], alpha=0.4, s=15)
ax.scatter(c[:, 0], c[:, 1], c='black', marker='X', s=200, zorder=5)
ax.set_title(f'K-Means: Iteration {frame + 1}')
ax.set_xlim(X[:, 0].min()-1, X[:, 0].max()+1)
ax.set_ylim(X[:, 1].min()-1, X[:, 1].max()+1)
ax.grid(True, alpha=0.3)
anim = FuncAnimation(fig, update, frames=len(history), interval=600)
plt.show()
The Boundary Problem
K-Means makes a hard decision for every point. But what about a point sitting exactly between two centroids? It gets forced into one cluster with 100% confidence, even though a 50/50 split would be more honest.
# Distance ratios reveal ambiguous points
distances = np.linalg.norm(X[:, None] - centroids[None, :], axis=2)
sorted_dists = np.sort(distances, axis=1)
ambiguity = 1 - (sorted_dists[:, 0] / sorted_dists[:, 1])
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
ax = axes[0]
for k in range(3):
mask = labels == k
ax.scatter(X[mask, 0], X[mask, 1], c=colours[k], alpha=0.5, s=20)
ax.scatter(centroids[:, 0], centroids[:, 1], c='black', marker='X', s=200, zorder=5)
ax.set_title('K-Means: Hard Assignments')
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$'); ax.grid(True, alpha=0.3)
ax = axes[1]
sc = ax.scatter(X[:, 0], X[:, 1], c=ambiguity, cmap='RdYlGn_r', s=20, alpha=0.7)
ax.scatter(centroids[:, 0], centroids[:, 1], c='black', marker='X', s=200, zorder=5)
plt.colorbar(sc, ax=ax, label='Ambiguity')
ax.set_title('How Uncertain Is Each Assignment?')
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$'); ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
The red points near the cluster boundaries are assigned with high confidence by K-Means, but they're actually ambiguous. This is the fundamental limitation of hard clustering.
Going Deeper
The Distortion Objective
K-Means minimises the distortion (also called inertia or within-cluster sum of squares):
Where $r_{nk} \in \{0, 1\}$ is the hard responsibility — 1 if point $n$ is assigned to cluster $k$, 0 otherwise. The centroid $\boldsymbol{\mu}_k$ is the mean of all points assigned to cluster $k$.
K-Means Is EM with Hard Assignments
Here's the key insight: K-Means is a special case of the EM algorithm where the responsibilities are binary (0 or 1) instead of continuous probabilities.
To see this, write K-Means in EM notation:
def kmeans_as_em(X, K, max_iter=100, seed=42):
"""K-Means expressed as EM with hard (0/1) responsibilities."""
rng = np.random.default_rng(seed)
N, D = X.shape
indices = rng.choice(N, K, replace=False)
centroids = X[indices].copy()
for iteration in range(max_iter):
# E-STEP: Compute hard responsibilities r(n,k) ∈ {0, 1}
distances = np.linalg.norm(X[:, None] - centroids[None, :], axis=2)
r = np.zeros((N, K))
r[np.arange(N), np.argmin(distances, axis=1)] = 1.0
# M-STEP: Update centroids using responsibilities
Nk = r.sum(axis=0)
new_centroids = (r.T @ X) / Nk[:, None]
if np.allclose(centroids, new_centroids):
break
centroids = new_centroids
return centroids, r
Compare this with the GMM E-step where $r_{nk}$ is a probability between 0 and 1. The only difference is whether the responsibility is hard or soft.
As Bishop (2006) puts it: K-Means corresponds to the zero-temperature limit of the GMM. If you shrink all covariance matrices toward zero, the Gaussian components become infinitely peaked, and the soft responsibilities snap to 0 or 1.
K-Means vs GMM: Side by Side
Let's fit both to the same data and compare:
from matplotlib.patches import Ellipse
def draw_ellipse(ax, mean, cov, colour, n_std=2):
vals, vecs = np.linalg.eigh(cov)
angle = np.degrees(np.arctan2(vecs[1, 1], vecs[0, 1]))
width, height = 2 * n_std * np.sqrt(vals)
ell = Ellipse(xy=mean, width=width, height=height, angle=angle,
edgecolor=colour, facecolor='none', linewidth=2, linestyle='--')
ax.add_patch(ell)
fig, axes = plt.subplots(1, 2, figsize=(13, 5.5))
# K-Means
ax = axes[0]
for k in range(3):
mask = labels == k
ax.scatter(X[mask, 0], X[mask, 1], c=colours[k], alpha=0.5, s=20)
ax.scatter(centroids[:, 0], centroids[:, 1], c='black', marker='X', s=200, zorder=5)
ax.set_title('K-Means: Hard Assignments\n(each point → exactly one cluster)')
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$'); ax.grid(True, alpha=0.3)
# GMM
ax = axes[1]
gmm_labels = np.argmax(gmm_resp, axis=1)
for k in range(3):
mask = gmm_labels == k
alphas = gmm_resp[mask, k]
ax.scatter(X[mask, 0], X[mask, 1], c=colours[k], alpha=0.3 + 0.5*alphas, s=20)
draw_ellipse(ax, gmm_means[k], gmm_covs[k], colours[k])
ax.scatter(gmm_means[:, 0], gmm_means[:, 1], c='black', marker='X', s=200, zorder=5)
ax.set_title('GMM: Soft Assignments\n(each point → probability per cluster)')
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$'); ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Notice two key differences:
- Cluster shapes — K-Means produces spherical (Voronoi) boundaries. The GMM captures the elongated shape of each cluster with covariance ellipses.
- Transparency — In the GMM plot, points near boundaries are more transparent, reflecting their split responsibilities.
The Assumptions Progression
K-Means, GMMs, and Hidden Markov Models form a natural progression. Each relaxes one assumption from its predecessor:
| K-Means | GMM | HMM | |
|---|---|---|---|
| Responsibilities | Hard (0 or 1) | Soft (probabilities) | Soft (probabilities) |
| Data assumption | i.i.d. | i.i.d. | Sequentially dependent |
| Cluster shape | Spherical (isotropic) | Elliptical (full covariance) | Elliptical (full covariance) |
| Objective | Distortion $J$
|
Log-likelihood $\ell$
|
Log-likelihood $\ell$
|
| Fitting | Lloyd's algorithm | EM | Baum-Welch (EM for sequences) |
K-Means → GMM: relax hard to soft assignments. GMM → HMM: add sequential dependency between observations.
Choosing K: The Elbow Method
K-Means requires you to pick $K$ upfront. The elbow method plots the final distortion against $K$ and looks for the "bend":
K_range = range(1, 8)
final_distortions = []
for K in K_range:
_, _, _, dists = kmeans(X, K=K)
final_distortions.append(dists[-1])
plt.figure(figsize=(7, 4))
plt.plot(list(K_range), final_distortions, 'ko-', markersize=8)
plt.axvline(x=3, color='red', linestyle='--', alpha=0.5, label='True K=3')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Final Distortion $J$')
plt.title('Elbow Method for Choosing K')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
Distortion always decreases with more clusters (at $K = N$ it's zero). The "elbow" is where adding another cluster gives diminishing returns. For our data, the bend at $K = 3$ is clear.
For a more principled approach, GMMs offer the Bayesian Information Criterion (BIC), which penalises model complexity — see the GMM tutorial for details.
Hyperparameters
| Parameter | What it controls | Guidance |
|---|---|---|
K |
Number of clusters | Use elbow method, silhouette score, or BIC |
max_iter |
Maximum iterations | 100-300 usually sufficient |
seed / initialisation |
Starting centroids | K-Means++ (default in scikit-learn) is more robust than random |
n_init |
Number of random restarts | 10 is typical; keeps the best result |
When NOT to Use K-Means
- Non-spherical clusters — K-Means assumes isotropic distance. Elongated or curved clusters get split incorrectly. Use GMMs or DBSCAN.
- Unequal cluster sizes — K-Means tends to produce roughly equal-sized clusters, even when the true clusters differ dramatically in size.
- You need uncertainty — If a point's cluster membership matters (e.g., for downstream decisions), use a GMM to get probabilities.
- Sequential data — If your observations have a natural ordering (time series), the i.i.d. assumption breaks. Use HMMs instead.
When K-Means wins: It's fast ($\mathcal{O}(NKD)$ per iteration), scales to millions of points, and when clusters are roughly spherical and well-separated, it gives the same answer as the GMM with far less computation.
Deep Dive: The Papers
Lloyd (1982) and the History of K-Means
The algorithm we call "K-Means" was first described by Stuart Lloyd in a 1957 Bell Labs technical report, but wasn't formally published until 1982 as "Least Squares Quantization in PCM" in IEEE Transactions on Information Theory. Lloyd was working on pulse-code modulation — converting analogue signals to digital — and needed to find optimal quantisation levels that minimised reconstruction error.
Meanwhile, James MacQueen independently described and named the algorithm "K-Means" in a 1967 conference paper, giving it the name we still use today.
The Distortion Objective
Lloyd's algorithm minimises:
Each iteration alternates between:
-
Minimise
$J$over$r_{nk}$(with$\boldsymbol{\mu}_k$fixed) — this is the assignment step -
Minimise
$J$over$\boldsymbol{\mu}_k$(with$r_{nk}$fixed) — this is the update step
Setting the derivative of $J$ with respect to $\boldsymbol{\mu}_k$ to zero gives:
which is just the mean of the assigned points — exactly what our code computes.
Bishop's Unified View (PRML Chapter 9)
Bishop's Pattern Recognition and Machine Learning (2006), Chapter 9, provides the clearest treatment of the K-Means–GMM connection. He shows that K-Means arises as a limiting case of the GMM when we:
- Fix all covariance matrices to
$\epsilon \mathbf{I}$(spherical, equal variance) - Take the limit
$\epsilon \to 0$
As $\epsilon$ shrinks, the Gaussian responsibilities $r_{nk} = \pi_k \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \epsilon \mathbf{I})$ become increasingly peaked, converging to 0 or 1 — exactly the hard assignments of K-Means.
"We can also derive the K-means algorithm as a particular limit of the EM algorithm for Gaussian mixtures."
— Bishop, PRML (2006), §9.3.2
Our Implementation vs the Papers
| Paper concept | Our code |
|---|---|
Distortion $J$
|
distortion = sum(np.sum((X[labels == k] - centroids[k])**2) ...) |
Hard responsibilities $r_{nk}$
|
labels = np.argmin(distances, axis=1) |
Centroid update $\boldsymbol{\mu}_k$
|
X[labels == k].mean(axis=0) |
| Lloyd's assign-update loop | The for iteration loop |
Further Reading
- Lloyd (1982) — The original "Least Squares Quantization in PCM" paper
- MacQueen (1967) — Where the name "K-Means" was coined
- Bishop's PRML, Chapter 9 — The clearest K-Means ↔ GMM connection
- Arthur & Vassilvitskii (2007) — K-Means++ initialisation
Try It Yourself
The interactive notebook includes exercises:
- Initialisation sensitivity — Run K-Means with 10 different random seeds. How often does it find the "right" clusters? Try K-Means++ initialisation.
- Non-spherical clusters — Create elongated clusters. How does K-Means handle them compared to GMM?
- Old Faithful — Apply both K-Means and GMM to the Old Faithful geyser data (2D: duration + waiting time). Compare the cluster boundaries.
-
Soft K-Means — Modify the code to use soft assignments:
$r_{nk} = e^{-\beta d_{nk}^2} / \sum_j e^{-\beta d_{nj}^2}$. What happens as$\beta \to \infty$? -
Compare with scikit-learn — Verify your implementation matches
sklearn.cluster.KMeanson the same data.
Interactive Tools
- Distribution Explorer — Visualise the Gaussian and other distributions that underpin mixture models
- Classification Metrics Calculator — Evaluate clustering quality with metrics like precision and recall
Related Posts
- The EM Algorithm: An Intuitive Guide — K-Means is EM with hard assignments; the coin-toss example shows why soft assignments work better
- Gaussian Mixture Models: EM in Practice — The soft-clustering generalisation of K-Means, applied to Old Faithful geyser data
- Maximum Likelihood Estimation from Scratch — The M-step in both K-Means and GMM is a form of (weighted) MLE
Frequently Asked Questions
What is the difference between hard and soft clustering?
Hard clustering assigns each data point to exactly one cluster (as in K-Means), while soft clustering assigns a probability of belonging to each cluster (as in GMMs). Soft clustering is more informative because it captures uncertainty: points near cluster boundaries get meaningful probabilities for multiple clusters rather than being forced into one.
When should I use GMM instead of K-Means?
Use GMM when your clusters have different shapes, sizes, or orientations, since K-Means assumes spherical clusters of similar size. GMM is also better when you need membership probabilities rather than hard assignments, or when clusters overlap significantly. K-Means is faster and simpler, so use it when clusters are roughly spherical and well-separated.
How do I choose the number of components in a GMM?
Use information criteria like BIC or AIC, which balance fit quality against model complexity. Fit GMMs with different numbers of components and pick the one with the lowest BIC. You can also use silhouette scores or domain knowledge. Unlike K-Means, GMM can also use the likelihood to compare models directly.
Can GMM handle clusters of different shapes?
Yes. Each component in a GMM has its own covariance matrix, which can model elliptical clusters at any orientation. By contrast, K-Means implicitly assumes spherical clusters. You can also constrain the covariance matrices (diagonal, spherical, tied) to reduce the number of parameters when data is limited.
Why does K-Means sometimes give poor results?
K-Means fails when clusters are non-spherical, have very different sizes, or overlap heavily. It is also sensitive to initialisation (solved partially by K-Means++) and to outliers, which can pull centroids away from the true cluster centres. If K-Means gives poor results, GMM or density-based methods like DBSCAN may be more appropriate.





Top comments (0)