Berkan Sesen

Posted on Mar 30 • Originally published at sesen.ai

From K-Means to GMM: Hard vs Soft Clustering

#unsupervisedlearning #optimisation

You have a pile of unlabelled data and you want to find groups in it. K-Means is the algorithm everyone reaches for first — it's fast, simple, and usually works. But it makes a bold assumption: every data point belongs to exactly one cluster. No uncertainty, no hedging.

What happens to a point sitting right between two clusters? K-Means forces a choice. Gaussian Mixture Models (GMMs) offer an alternative — soft assignments that express how uncertain we are. By the end of this post, you'll implement K-Means from scratch, see why it's secretly a special case of the EM algorithm, and understand exactly when soft clustering beats hard clustering.

Quick Win: K-Means from Scratch

Let's cluster some data. Click the badge to open the interactive notebook:

Here's K-Means in 30 lines — Lloyd's algorithm:

import numpy as np

def kmeans(X, K, max_iter=100, seed=42):
    """
    K-Means clustering (Lloyd's algorithm).

    Args:
        X: (N, D) array of data points
        K: number of clusters
        max_iter: maximum iterations
        seed: random seed for reproducibility

    Returns:
        centroids: (K, D) final cluster centres
        labels: (N,) hard cluster assignments
        history: list of (centroids, labels) at each iteration
        distortions: list of objective function values
    """
    rng = np.random.default_rng(seed)
    N, D = X.shape

    # Step 1: Initialise centroids by picking K random data points
    indices = rng.choice(N, K, replace=False)
    centroids = X[indices].copy()

    history = []
    distortions = []

    for iteration in range(max_iter):
        # Step 2: ASSIGN — each point to nearest centroid
        distances = np.linalg.norm(X[:, None] - centroids[None, :], axis=2)
        labels = np.argmin(distances, axis=1)

        # Compute distortion (objective function)
        distortion = sum(np.sum((X[labels == k] - centroids[k])**2) for k in range(K))
        distortions.append(distortion)
        history.append((centroids.copy(), labels.copy()))

        # Step 3: UPDATE — move centroids to cluster means
        new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(K)])

        # Check convergence
        if np.allclose(centroids, new_centroids):
            break

        centroids = new_centroids

    return centroids, labels, history, distortions

Run it on three synthetic clusters:

import matplotlib.pyplot as plt

rng = np.random.default_rng(42)
n_per_cluster = 100
cluster_1 = rng.multivariate_normal([2, 2], [[0.8, 0.4], [0.4, 0.3]], n_per_cluster)
cluster_2 = rng.multivariate_normal([7, 7], [[0.5, -0.3], [-0.3, 0.8]], n_per_cluster)
cluster_3 = rng.multivariate_normal([2, 8], [[0.3, 0.0], [0.0, 0.6]], n_per_cluster)
X = np.vstack([cluster_1, cluster_2, cluster_3])

centroids, labels, history, distortions = kmeans(X, K=3)

colours = ['#e74c3c', '#3498db', '#2ecc71']

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

ax = axes[0]
for k in range(3):
    mask = labels == k
    ax.scatter(X[mask, 0], X[mask, 1], c=colours[k], alpha=0.5, s=20, label=f'Cluster {k+1}')
ax.scatter(centroids[:, 0], centroids[:, 1], c='black', marker='X', s=200, zorder=5, label='Centroids')
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')
ax.set_title('K-Means Result (K=3)')
ax.legend(); ax.grid(True, alpha=0.3)

ax = axes[1]
ax.plot(range(1, len(distortions) + 1), distortions, 'k-o', markersize=6)
ax.set_xlabel('Iteration'); ax.set_ylabel('Distortion $J$')
ax.set_title('K-Means Objective Decreases at Each Step')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

The result: K-Means finds three clean clusters in about 5 iterations. The distortion (sum of squared distances to centroids) drops sharply and converges. Each point is assigned to exactly one cluster — no ambiguity.

What Just Happened?

K-Means repeats two steps until nothing changes:

Step 1: Assign Points to Nearest Centroid

distances = np.linalg.norm(X[:, None] - centroids[None, :], axis=2)
labels = np.argmin(distances, axis=1)

For each data point, compute the Euclidean distance to every centroid and pick the closest one. This creates hard responsibilities — each point gets a label of 0, 1, or 2, with no in-between.

Step 2: Move Centroids to Cluster Means

new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(K)])

Each centroid moves to the average position of all points assigned to it. Think of it as a "centre of gravity" — the centroid is pulled toward its members.

Why It Converges

Each iteration can only decrease (or maintain) the distortion. Assigning a point to a closer centroid reduces the sum of squared distances. Moving a centroid to the cluster mean minimises the within-cluster variance. Since distortion is bounded below by zero and decreases monotonically, the algorithm must converge.

Let's watch this happen step by step:

The centroids start at random data points and quickly settle into the cluster centres. By iteration 3-4, the assignments are essentially fixed.

from matplotlib.animation import FuncAnimation, PillowWriter

fig, ax = plt.subplots(figsize=(8, 6))

def update(frame):
    ax.clear()
    c, l = history[frame]
    for k in range(3):
        mask = l == k
        ax.scatter(X[mask, 0], X[mask, 1], c=colours[k], alpha=0.4, s=15)
    ax.scatter(c[:, 0], c[:, 1], c='black', marker='X', s=200, zorder=5)
    ax.set_title(f'K-Means: Iteration {frame + 1}')
    ax.set_xlim(X[:, 0].min()-1, X[:, 0].max()+1)
    ax.set_ylim(X[:, 1].min()-1, X[:, 1].max()+1)
    ax.grid(True, alpha=0.3)

anim = FuncAnimation(fig, update, frames=len(history), interval=600)
plt.show()

The Boundary Problem

K-Means makes a hard decision for every point. But what about a point sitting exactly between two centroids? It gets forced into one cluster with 100% confidence, even though a 50/50 split would be more honest.

# Distance ratios reveal ambiguous points
distances = np.linalg.norm(X[:, None] - centroids[None, :], axis=2)
sorted_dists = np.sort(distances, axis=1)
ambiguity = 1 - (sorted_dists[:, 0] / sorted_dists[:, 1])

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

ax = axes[0]
for k in range(3):
    mask = labels == k
    ax.scatter(X[mask, 0], X[mask, 1], c=colours[k], alpha=0.5, s=20)
ax.scatter(centroids[:, 0], centroids[:, 1], c='black', marker='X', s=200, zorder=5)
ax.set_title('K-Means: Hard Assignments')
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$'); ax.grid(True, alpha=0.3)

ax = axes[1]
sc = ax.scatter(X[:, 0], X[:, 1], c=ambiguity, cmap='RdYlGn_r', s=20, alpha=0.7)
ax.scatter(centroids[:, 0], centroids[:, 1], c='black', marker='X', s=200, zorder=5)
plt.colorbar(sc, ax=ax, label='Ambiguity')
ax.set_title('How Uncertain Is Each Assignment?')
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$'); ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

The red points near the cluster boundaries are assigned with high confidence by K-Means, but they're actually ambiguous. This is the fundamental limitation of hard clustering.

Going Deeper

The Distortion Objective

K-Means minimises the distortion (also called inertia or within-cluster sum of squares):

Where $r_{nk} \in \{0, 1\}$ is the hard responsibility — 1 if point $n$ is assigned to cluster $k$ , 0 otherwise. The centroid $\boldsymbol{\mu}_k$ is the mean of all points assigned to cluster $k$ .

K-Means Is EM with Hard Assignments

Here's the key insight: K-Means is a special case of the EM algorithm where the responsibilities are binary (0 or 1) instead of continuous probabilities.

To see this, write K-Means in EM notation:

def kmeans_as_em(X, K, max_iter=100, seed=42):
    """K-Means expressed as EM with hard (0/1) responsibilities."""
    rng = np.random.default_rng(seed)
    N, D = X.shape

    indices = rng.choice(N, K, replace=False)
    centroids = X[indices].copy()

    for iteration in range(max_iter):
        # E-STEP: Compute hard responsibilities r(n,k) ∈ {0, 1}
        distances = np.linalg.norm(X[:, None] - centroids[None, :], axis=2)
        r = np.zeros((N, K))
        r[np.arange(N), np.argmin(distances, axis=1)] = 1.0

        # M-STEP: Update centroids using responsibilities
        Nk = r.sum(axis=0)
        new_centroids = (r.T @ X) / Nk[:, None]

        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids

    return centroids, r

Compare this with the GMM E-step where $r_{nk}$ is a probability between 0 and 1. The only difference is whether the responsibility is hard or soft.

As Bishop (2006) puts it: K-Means corresponds to the zero-temperature limit of the GMM. If you shrink all covariance matrices toward zero, the Gaussian components become infinitely peaked, and the soft responsibilities snap to 0 or 1.

K-Means vs GMM: Side by Side

Let's fit both to the same data and compare:

from matplotlib.patches import Ellipse

def draw_ellipse(ax, mean, cov, colour, n_std=2):
    vals, vecs = np.linalg.eigh(cov)
    angle = np.degrees(np.arctan2(vecs[1, 1], vecs[0, 1]))
    width, height = 2 * n_std * np.sqrt(vals)
    ell = Ellipse(xy=mean, width=width, height=height, angle=angle,
                  edgecolor=colour, facecolor='none', linewidth=2, linestyle='--')
    ax.add_patch(ell)

fig, axes = plt.subplots(1, 2, figsize=(13, 5.5))

# K-Means
ax = axes[0]
for k in range(3):
    mask = labels == k
    ax.scatter(X[mask, 0], X[mask, 1], c=colours[k], alpha=0.5, s=20)
ax.scatter(centroids[:, 0], centroids[:, 1], c='black', marker='X', s=200, zorder=5)
ax.set_title('K-Means: Hard Assignments\n(each point → exactly one cluster)')
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$'); ax.grid(True, alpha=0.3)

# GMM
ax = axes[1]
gmm_labels = np.argmax(gmm_resp, axis=1)
for k in range(3):
    mask = gmm_labels == k
    alphas = gmm_resp[mask, k]
    ax.scatter(X[mask, 0], X[mask, 1], c=colours[k], alpha=0.3 + 0.5*alphas, s=20)
    draw_ellipse(ax, gmm_means[k], gmm_covs[k], colours[k])
ax.scatter(gmm_means[:, 0], gmm_means[:, 1], c='black', marker='X', s=200, zorder=5)
ax.set_title('GMM: Soft Assignments\n(each point → probability per cluster)')
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$'); ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Notice two key differences:

Cluster shapes — K-Means produces spherical (Voronoi) boundaries. The GMM captures the elongated shape of each cluster with covariance ellipses.
Transparency — In the GMM plot, points near boundaries are more transparent, reflecting their split responsibilities.

The Assumptions Progression

K-Means, GMMs, and Hidden Markov Models form a natural progression. Each relaxes one assumption from its predecessor:

	K-Means	GMM	HMM
Responsibilities	Hard (0 or 1)	Soft (probabilities)	Soft (probabilities)
Data assumption	i.i.d.	i.i.d.	Sequentially dependent
Cluster shape	Spherical (isotropic)	Elliptical (full covariance)	Elliptical (full covariance)
Objective	Distortion $J$	Log-likelihood $\ell$	Log-likelihood $\ell$
Fitting	Lloyd's algorithm	EM	Baum-Welch (EM for sequences)

K-Means → GMM: relax hard to soft assignments. GMM → HMM: add sequential dependency between observations.

Choosing K: The Elbow Method

K-Means requires you to pick $K$ upfront. The elbow method plots the final distortion against $K$ and looks for the "bend":

K_range = range(1, 8)
final_distortions = []

for K in K_range:
    _, _, _, dists = kmeans(X, K=K)
    final_distortions.append(dists[-1])

plt.figure(figsize=(7, 4))
plt.plot(list(K_range), final_distortions, 'ko-', markersize=8)
plt.axvline(x=3, color='red', linestyle='--', alpha=0.5, label='True K=3')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Final Distortion $J$')
plt.title('Elbow Method for Choosing K')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Distortion always decreases with more clusters (at $K = N$ it's zero). The "elbow" is where adding another cluster gives diminishing returns. For our data, the bend at $K = 3$ is clear.

For a more principled approach, GMMs offer the Bayesian Information Criterion (BIC), which penalises model complexity — see the GMM tutorial for details.

Hyperparameters

Parameter	What it controls	Guidance
`K`	Number of clusters	Use elbow method, silhouette score, or BIC
`max_iter`	Maximum iterations	100-300 usually sufficient
`seed` / initialisation	Starting centroids	K-Means++ (default in scikit-learn) is more robust than random
`n_init`	Number of random restarts	10 is typical; keeps the best result

When NOT to Use K-Means

Non-spherical clusters — K-Means assumes isotropic distance. Elongated or curved clusters get split incorrectly. Use GMMs or DBSCAN.
Unequal cluster sizes — K-Means tends to produce roughly equal-sized clusters, even when the true clusters differ dramatically in size.
You need uncertainty — If a point's cluster membership matters (e.g., for downstream decisions), use a GMM to get probabilities.
Sequential data — If your observations have a natural ordering (time series), the i.i.d. assumption breaks. Use HMMs instead.

When K-Means wins: It's fast ( $\mathcal{O}(NKD)$ per iteration), scales to millions of points, and when clusters are roughly spherical and well-separated, it gives the same answer as the GMM with far less computation.

Deep Dive: The Papers

Lloyd (1982) and the History of K-Means

The algorithm we call "K-Means" was first described by Stuart Lloyd in a 1957 Bell Labs technical report, but wasn't formally published until 1982 as "Least Squares Quantization in PCM" in IEEE Transactions on Information Theory. Lloyd was working on pulse-code modulation — converting analogue signals to digital — and needed to find optimal quantisation levels that minimised reconstruction error.

Meanwhile, James MacQueen independently described and named the algorithm "K-Means" in a 1967 conference paper, giving it the name we still use today.

The Distortion Objective

Lloyd's algorithm minimises:

Each iteration alternates between:

Minimise $J$ over $r_{nk}$ (with $\boldsymbol{\mu}_k$ fixed) — this is the assignment step
Minimise $J$ over $\boldsymbol{\mu}_k$ (with $r_{nk}$ fixed) — this is the update step

Setting the derivative of $J$ with respect to $\boldsymbol{\mu}_k$ to zero gives:

which is just the mean of the assigned points — exactly what our code computes.

Bishop's Unified View (PRML Chapter 9)

Bishop's Pattern Recognition and Machine Learning (2006), Chapter 9, provides the clearest treatment of the K-Means–GMM connection. He shows that K-Means arises as a limiting case of the GMM when we:

Fix all covariance matrices to $\epsilon \mathbf{I}$ (spherical, equal variance)
Take the limit $\epsilon \to 0$

As $\epsilon$ shrinks, the Gaussian responsibilities $r_{nk} = \pi_k \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \epsilon \mathbf{I})$ become increasingly peaked, converging to 0 or 1 — exactly the hard assignments of K-Means.

"We can also derive the K-means algorithm as a particular limit of the EM algorithm for Gaussian mixtures."
— Bishop, PRML (2006), §9.3.2

Our Implementation vs the Papers

Paper concept	Our code
Distortion $J$	`distortion = sum(np.sum((X[labels == k] - centroids[k])**2) ...)`
Hard responsibilities $r_{nk}$	`labels = np.argmin(distances, axis=1)`
Centroid update $\boldsymbol{\mu}_k$	`X[labels == k].mean(axis=0)`
Lloyd's assign-update loop	The `for iteration` loop

Try It Yourself

The interactive notebook includes exercises:

Initialisation sensitivity — Run K-Means with 10 different random seeds. How often does it find the "right" clusters? Try K-Means++ initialisation.
Non-spherical clusters — Create elongated clusters. How does K-Means handle them compared to GMM?
Old Faithful — Apply both K-Means and GMM to the Old Faithful geyser data (2D: duration + waiting time). Compare the cluster boundaries.
Soft K-Means — Modify the code to use soft assignments: $r_{nk} = e^{-\beta d_{nk}^2} / \sum_j e^{-\beta d_{nj}^2}$ . What happens as $\beta \to \infty$ ?
Compare with scikit-learn — Verify your implementation matches sklearn.cluster.KMeans on the same data.

Interactive Tools

Distribution Explorer — Visualise the Gaussian and other distributions that underpin mixture models
Classification Metrics Calculator — Evaluate clustering quality with metrics like precision and recall

The EM Algorithm: An Intuitive Guide — K-Means is EM with hard assignments; the coin-toss example shows why soft assignments work better
Gaussian Mixture Models: EM in Practice — The soft-clustering generalisation of K-Means, applied to Old Faithful geyser data
Maximum Likelihood Estimation from Scratch — The M-step in both K-Means and GMM is a form of (weighted) MLE

Frequently Asked Questions

What is the difference between hard and soft clustering?

Hard clustering assigns each data point to exactly one cluster (as in K-Means), while soft clustering assigns a probability of belonging to each cluster (as in GMMs). Soft clustering is more informative because it captures uncertainty: points near cluster boundaries get meaningful probabilities for multiple clusters rather than being forced into one.

When should I use GMM instead of K-Means?

Use GMM when your clusters have different shapes, sizes, or orientations, since K-Means assumes spherical clusters of similar size. GMM is also better when you need membership probabilities rather than hard assignments, or when clusters overlap significantly. K-Means is faster and simpler, so use it when clusters are roughly spherical and well-separated.

How do I choose the number of components in a GMM?

Use information criteria like BIC or AIC, which balance fit quality against model complexity. Fit GMMs with different numbers of components and pick the one with the lowest BIC. You can also use silhouette scores or domain knowledge. Unlike K-Means, GMM can also use the likelihood to compare models directly.

Can GMM handle clusters of different shapes?

Yes. Each component in a GMM has its own covariance matrix, which can model elliptical clusters at any orientation. By contrast, K-Means implicitly assumes spherical clusters. You can also constrain the covariance matrices (diagonal, spherical, tied) to reduce the number of parameters when data is limited.

Why does K-Means sometimes give poor results?

K-Means fails when clusters are non-spherical, have very different sizes, or overlap heavily. It is also sensitive to initialisation (solved partially by K-Means++) and to outliers, which can pull centroids away from the true cluster centres. If K-Means gives poor results, GMM or density-based methods like DBSCAN may be more appropriate.

DEV Community

From K-Means to GMM: Hard vs Soft Clustering

Quick Win: K-Means from Scratch

What Just Happened?

Step 1: Assign Points to Nearest Centroid

Step 2: Move Centroids to Cluster Means

Why It Converges

The Boundary Problem

Going Deeper

The Distortion Objective

K-Means Is EM with Hard Assignments

K-Means vs GMM: Side by Side

The Assumptions Progression

Choosing K: The Elbow Method

Hyperparameters

When NOT to Use K-Means

Deep Dive: The Papers

Lloyd (1982) and the History of K-Means

The Distortion Objective

Bishop's Unified View (PRML Chapter 9)

Our Implementation vs the Papers

Further Reading

Try It Yourself

Interactive Tools

Related Posts

Frequently Asked Questions

What is the difference between hard and soft clustering?

When should I use GMM instead of K-Means?

How do I choose the number of components in a GMM?

Can GMM handle clusters of different shapes?

Why does K-Means sometimes give poor results?

Top comments (0)