DEV Community

Cover image for From K-Means to GMM: Hard vs Soft Clustering
Berkan Sesen
Berkan Sesen

Posted on • Originally published at sesen.ai

From K-Means to GMM: Hard vs Soft Clustering

You have a pile of unlabelled data and you want to find groups in it. K-Means is the algorithm everyone reaches for first — it's fast, simple, and usually works. But it makes a bold assumption: every data point belongs to exactly one cluster. No uncertainty, no hedging.

What happens to a point sitting right between two clusters? K-Means forces a choice. Gaussian Mixture Models (GMMs) offer an alternative — soft assignments that express how uncertain we are. By the end of this post, you'll implement K-Means from scratch, see why it's secretly a special case of the EM algorithm, and understand exactly when soft clustering beats hard clustering.

Quick Win: K-Means from Scratch

Let's cluster some data. Click the badge to open the interactive notebook:

Open In Colab

Here's K-Means in 30 lines — Lloyd's algorithm:

import numpy as np

def kmeans(X, K, max_iter=100, seed=42):
    """
    K-Means clustering (Lloyd's algorithm).

    Args:
        X: (N, D) array of data points
        K: number of clusters
        max_iter: maximum iterations
        seed: random seed for reproducibility

    Returns:
        centroids: (K, D) final cluster centres
        labels: (N,) hard cluster assignments
        history: list of (centroids, labels) at each iteration
        distortions: list of objective function values
    """
    rng = np.random.default_rng(seed)
    N, D = X.shape

    # Step 1: Initialise centroids by picking K random data points
    indices = rng.choice(N, K, replace=False)
    centroids = X[indices].copy()

    history = []
    distortions = []

    for iteration in range(max_iter):
        # Step 2: ASSIGN — each point to nearest centroid
        distances = np.linalg.norm(X[:, None] - centroids[None, :], axis=2)
        labels = np.argmin(distances, axis=1)

        # Compute distortion (objective function)
        distortion = sum(np.sum((X[labels == k] - centroids[k])**2) for k in range(K))
        distortions.append(distortion)
        history.append((centroids.copy(), labels.copy()))

        # Step 3: UPDATE — move centroids to cluster means
        new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(K)])

        # Check convergence
        if np.allclose(centroids, new_centroids):
            break

        centroids = new_centroids

    return centroids, labels, history, distortions
Enter fullscreen mode Exit fullscreen mode

Run it on three synthetic clusters:

import matplotlib.pyplot as plt

rng = np.random.default_rng(42)
n_per_cluster = 100
cluster_1 = rng.multivariate_normal([2, 2], [[0.8, 0.4], [0.4, 0.3]], n_per_cluster)
cluster_2 = rng.multivariate_normal([7, 7], [[0.5, -0.3], [-0.3, 0.8]], n_per_cluster)
cluster_3 = rng.multivariate_normal([2, 8], [[0.3, 0.0], [0.0, 0.6]], n_per_cluster)
X = np.vstack([cluster_1, cluster_2, cluster_3])

centroids, labels, history, distortions = kmeans(X, K=3)

colours = ['#e74c3c', '#3498db', '#2ecc71']

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

ax = axes[0]
for k in range(3):
    mask = labels == k
    ax.scatter(X[mask, 0], X[mask, 1], c=colours[k], alpha=0.5, s=20, label=f'Cluster {k+1}')
ax.scatter(centroids[:, 0], centroids[:, 1], c='black', marker='X', s=200, zorder=5, label='Centroids')
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')
ax.set_title('K-Means Result (K=3)')
ax.legend(); ax.grid(True, alpha=0.3)

ax = axes[1]
ax.plot(range(1, len(distortions) + 1), distortions, 'k-o', markersize=6)
ax.set_xlabel('Iteration'); ax.set_ylabel('Distortion $J$')
ax.set_title('K-Means Objective Decreases at Each Step')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
Enter fullscreen mode Exit fullscreen mode

K-Means clustering result showing three well-separated clusters with centroids, alongside the distortion curve decreasing at each iteration

The result: K-Means finds three clean clusters in about 5 iterations. The distortion (sum of squared distances to centroids) drops sharply and converges. Each point is assigned to exactly one cluster — no ambiguity.

What Just Happened?

K-Means repeats two steps until nothing changes:

Step 1: Assign Points to Nearest Centroid

distances = np.linalg.norm(X[:, None] - centroids[None, :], axis=2)
labels = np.argmin(distances, axis=1)
Enter fullscreen mode Exit fullscreen mode

For each data point, compute the Euclidean distance to every centroid and pick the closest one. This creates hard responsibilities — each point gets a label of 0, 1, or 2, with no in-between.

Step 2: Move Centroids to Cluster Means

new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(K)])
Enter fullscreen mode Exit fullscreen mode

Each centroid moves to the average position of all points assigned to it. Think of it as a "centre of gravity" — the centroid is pulled toward its members.

Why It Converges

Each iteration can only decrease (or maintain) the distortion. Assigning a point to a closer centroid reduces the sum of squared distances. Moving a centroid to the cluster mean minimises the within-cluster variance. Since distortion is bounded below by zero and decreases monotonically, the algorithm must converge.

Let's watch this happen step by step:

K-Means algorithm converging over 4 iterations — centroids move and cluster assignments stabilise

The centroids start at random data points and quickly settle into the cluster centres. By iteration 3-4, the assignments are essentially fixed.

from matplotlib.animation import FuncAnimation, PillowWriter

fig, ax = plt.subplots(figsize=(8, 6))

def update(frame):
    ax.clear()
    c, l = history[frame]
    for k in range(3):
        mask = l == k
        ax.scatter(X[mask, 0], X[mask, 1], c=colours[k], alpha=0.4, s=15)
    ax.scatter(c[:, 0], c[:, 1], c='black', marker='X', s=200, zorder=5)
    ax.set_title(f'K-Means: Iteration {frame + 1}')
    ax.set_xlim(X[:, 0].min()-1, X[:, 0].max()+1)
    ax.set_ylim(X[:, 1].min()-1, X[:, 1].max()+1)
    ax.grid(True, alpha=0.3)

anim = FuncAnimation(fig, update, frames=len(history), interval=600)
plt.show()
Enter fullscreen mode Exit fullscreen mode

The Boundary Problem

K-Means makes a hard decision for every point. But what about a point sitting exactly between two centroids? It gets forced into one cluster with 100% confidence, even though a 50/50 split would be more honest.

# Distance ratios reveal ambiguous points
distances = np.linalg.norm(X[:, None] - centroids[None, :], axis=2)
sorted_dists = np.sort(distances, axis=1)
ambiguity = 1 - (sorted_dists[:, 0] / sorted_dists[:, 1])

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

ax = axes[0]
for k in range(3):
    mask = labels == k
    ax.scatter(X[mask, 0], X[mask, 1], c=colours[k], alpha=0.5, s=20)
ax.scatter(centroids[:, 0], centroids[:, 1], c='black', marker='X', s=200, zorder=5)
ax.set_title('K-Means: Hard Assignments')
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$'); ax.grid(True, alpha=0.3)

ax = axes[1]
sc = ax.scatter(X[:, 0], X[:, 1], c=ambiguity, cmap='RdYlGn_r', s=20, alpha=0.7)
ax.scatter(centroids[:, 0], centroids[:, 1], c='black', marker='X', s=200, zorder=5)
plt.colorbar(sc, ax=ax, label='Ambiguity')
ax.set_title('How Uncertain Is Each Assignment?')
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$'); ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
Enter fullscreen mode Exit fullscreen mode

K-Means hard assignments alongside an ambiguity heatmap showing uncertain points near cluster boundaries

The red points near the cluster boundaries are assigned with high confidence by K-Means, but they're actually ambiguous. This is the fundamental limitation of hard clustering.

Going Deeper

The Distortion Objective

K-Means minimises the distortion (also called inertia or within-cluster sum of squares):

equation

Where $r_{nk} \in \{0, 1\}$ is the hard responsibility — 1 if point $n$ is assigned to cluster $k$, 0 otherwise. The centroid $\boldsymbol{\mu}_k$ is the mean of all points assigned to cluster $k$.

K-Means Is EM with Hard Assignments

Here's the key insight: K-Means is a special case of the EM algorithm where the responsibilities are binary (0 or 1) instead of continuous probabilities.

To see this, write K-Means in EM notation:

def kmeans_as_em(X, K, max_iter=100, seed=42):
    """K-Means expressed as EM with hard (0/1) responsibilities."""
    rng = np.random.default_rng(seed)
    N, D = X.shape

    indices = rng.choice(N, K, replace=False)
    centroids = X[indices].copy()

    for iteration in range(max_iter):
        # E-STEP: Compute hard responsibilities r(n,k) ∈ {0, 1}
        distances = np.linalg.norm(X[:, None] - centroids[None, :], axis=2)
        r = np.zeros((N, K))
        r[np.arange(N), np.argmin(distances, axis=1)] = 1.0

        # M-STEP: Update centroids using responsibilities
        Nk = r.sum(axis=0)
        new_centroids = (r.T @ X) / Nk[:, None]

        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids

    return centroids, r
Enter fullscreen mode Exit fullscreen mode

Compare this with the GMM E-step where $r_{nk}$ is a probability between 0 and 1. The only difference is whether the responsibility is hard or soft.

As Bishop (2006) puts it: K-Means corresponds to the zero-temperature limit of the GMM. If you shrink all covariance matrices toward zero, the Gaussian components become infinitely peaked, and the soft responsibilities snap to 0 or 1.

K-Means vs GMM: Side by Side

Let's fit both to the same data and compare:

from matplotlib.patches import Ellipse

def draw_ellipse(ax, mean, cov, colour, n_std=2):
    vals, vecs = np.linalg.eigh(cov)
    angle = np.degrees(np.arctan2(vecs[1, 1], vecs[0, 1]))
    width, height = 2 * n_std * np.sqrt(vals)
    ell = Ellipse(xy=mean, width=width, height=height, angle=angle,
                  edgecolor=colour, facecolor='none', linewidth=2, linestyle='--')
    ax.add_patch(ell)

fig, axes = plt.subplots(1, 2, figsize=(13, 5.5))

# K-Means
ax = axes[0]
for k in range(3):
    mask = labels == k
    ax.scatter(X[mask, 0], X[mask, 1], c=colours[k], alpha=0.5, s=20)
ax.scatter(centroids[:, 0], centroids[:, 1], c='black', marker='X', s=200, zorder=5)
ax.set_title('K-Means: Hard Assignments\n(each point → exactly one cluster)')
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$'); ax.grid(True, alpha=0.3)

# GMM
ax = axes[1]
gmm_labels = np.argmax(gmm_resp, axis=1)
for k in range(3):
    mask = gmm_labels == k
    alphas = gmm_resp[mask, k]
    ax.scatter(X[mask, 0], X[mask, 1], c=colours[k], alpha=0.3 + 0.5*alphas, s=20)
    draw_ellipse(ax, gmm_means[k], gmm_covs[k], colours[k])
ax.scatter(gmm_means[:, 0], gmm_means[:, 1], c='black', marker='X', s=200, zorder=5)
ax.set_title('GMM: Soft Assignments\n(each point → probability per cluster)')
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$'); ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
Enter fullscreen mode Exit fullscreen mode

Side-by-side comparison of K-Means hard clustering and GMM soft clustering with covariance ellipses on the same data

Notice two key differences:

  1. Cluster shapes — K-Means produces spherical (Voronoi) boundaries. The GMM captures the elongated shape of each cluster with covariance ellipses.
  2. Transparency — In the GMM plot, points near boundaries are more transparent, reflecting their split responsibilities.

The Assumptions Progression

K-Means, GMMs, and Hidden Markov Models form a natural progression. Each relaxes one assumption from its predecessor:

K-Means GMM HMM
Responsibilities Hard (0 or 1) Soft (probabilities) Soft (probabilities)
Data assumption i.i.d. i.i.d. Sequentially dependent
Cluster shape Spherical (isotropic) Elliptical (full covariance) Elliptical (full covariance)
Objective Distortion $J$ Log-likelihood $\ell$ Log-likelihood $\ell$
Fitting Lloyd's algorithm EM Baum-Welch (EM for sequences)

K-Means → GMM: relax hard to soft assignments. GMM → HMM: add sequential dependency between observations.

Choosing K: The Elbow Method

K-Means requires you to pick $K$ upfront. The elbow method plots the final distortion against $K$ and looks for the "bend":

K_range = range(1, 8)
final_distortions = []

for K in K_range:
    _, _, _, dists = kmeans(X, K=K)
    final_distortions.append(dists[-1])

plt.figure(figsize=(7, 4))
plt.plot(list(K_range), final_distortions, 'ko-', markersize=8)
plt.axvline(x=3, color='red', linestyle='--', alpha=0.5, label='True K=3')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Final Distortion $J$')
plt.title('Elbow Method for Choosing K')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Elbow plot showing distortion decreasing with K, with a clear bend at K=3

Distortion always decreases with more clusters (at $K = N$ it's zero). The "elbow" is where adding another cluster gives diminishing returns. For our data, the bend at $K = 3$ is clear.

For a more principled approach, GMMs offer the Bayesian Information Criterion (BIC), which penalises model complexity — see the GMM tutorial for details.

Hyperparameters

Parameter What it controls Guidance
K Number of clusters Use elbow method, silhouette score, or BIC
max_iter Maximum iterations 100-300 usually sufficient
seed / initialisation Starting centroids K-Means++ (default in scikit-learn) is more robust than random
n_init Number of random restarts 10 is typical; keeps the best result

When NOT to Use K-Means

  1. Non-spherical clusters — K-Means assumes isotropic distance. Elongated or curved clusters get split incorrectly. Use GMMs or DBSCAN.
  2. Unequal cluster sizes — K-Means tends to produce roughly equal-sized clusters, even when the true clusters differ dramatically in size.
  3. You need uncertainty — If a point's cluster membership matters (e.g., for downstream decisions), use a GMM to get probabilities.
  4. Sequential data — If your observations have a natural ordering (time series), the i.i.d. assumption breaks. Use HMMs instead.

When K-Means wins: It's fast ($\mathcal{O}(NKD)$ per iteration), scales to millions of points, and when clusters are roughly spherical and well-separated, it gives the same answer as the GMM with far less computation.

Deep Dive: The Papers

Lloyd (1982) and the History of K-Means

The algorithm we call "K-Means" was first described by Stuart Lloyd in a 1957 Bell Labs technical report, but wasn't formally published until 1982 as "Least Squares Quantization in PCM" in IEEE Transactions on Information Theory. Lloyd was working on pulse-code modulation — converting analogue signals to digital — and needed to find optimal quantisation levels that minimised reconstruction error.

Meanwhile, James MacQueen independently described and named the algorithm "K-Means" in a 1967 conference paper, giving it the name we still use today.

The Distortion Objective

Lloyd's algorithm minimises:

equation

Each iteration alternates between:

  • Minimise $J$ over $r_{nk}$ (with $\boldsymbol{\mu}_k$ fixed) — this is the assignment step
  • Minimise $J$ over $\boldsymbol{\mu}_k$ (with $r_{nk}$ fixed) — this is the update step

Setting the derivative of $J$ with respect to $\boldsymbol{\mu}_k$ to zero gives:

equation

which is just the mean of the assigned points — exactly what our code computes.

Bishop's Unified View (PRML Chapter 9)

Bishop's Pattern Recognition and Machine Learning (2006), Chapter 9, provides the clearest treatment of the K-Means–GMM connection. He shows that K-Means arises as a limiting case of the GMM when we:

  1. Fix all covariance matrices to $\epsilon \mathbf{I}$ (spherical, equal variance)
  2. Take the limit $\epsilon \to 0$

As $\epsilon$ shrinks, the Gaussian responsibilities $r_{nk} = \pi_k \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \epsilon \mathbf{I})$ become increasingly peaked, converging to 0 or 1 — exactly the hard assignments of K-Means.

"We can also derive the K-means algorithm as a particular limit of the EM algorithm for Gaussian mixtures."
— Bishop, PRML (2006), §9.3.2

Our Implementation vs the Papers

Paper concept Our code
Distortion $J$ distortion = sum(np.sum((X[labels == k] - centroids[k])**2) ...)
Hard responsibilities $r_{nk}$ labels = np.argmin(distances, axis=1)
Centroid update $\boldsymbol{\mu}_k$ X[labels == k].mean(axis=0)
Lloyd's assign-update loop The for iteration loop

Further Reading

Try It Yourself

The interactive notebook includes exercises:

  1. Initialisation sensitivity — Run K-Means with 10 different random seeds. How often does it find the "right" clusters? Try K-Means++ initialisation.
  2. Non-spherical clusters — Create elongated clusters. How does K-Means handle them compared to GMM?
  3. Old Faithful — Apply both K-Means and GMM to the Old Faithful geyser data (2D: duration + waiting time). Compare the cluster boundaries.
  4. Soft K-Means — Modify the code to use soft assignments: $r_{nk} = e^{-\beta d_{nk}^2} / \sum_j e^{-\beta d_{nj}^2}$. What happens as $\beta \to \infty$?
  5. Compare with scikit-learn — Verify your implementation matches sklearn.cluster.KMeans on the same data.

Interactive Tools

Related Posts

Frequently Asked Questions

What is the difference between hard and soft clustering?

Hard clustering assigns each data point to exactly one cluster (as in K-Means), while soft clustering assigns a probability of belonging to each cluster (as in GMMs). Soft clustering is more informative because it captures uncertainty: points near cluster boundaries get meaningful probabilities for multiple clusters rather than being forced into one.

When should I use GMM instead of K-Means?

Use GMM when your clusters have different shapes, sizes, or orientations, since K-Means assumes spherical clusters of similar size. GMM is also better when you need membership probabilities rather than hard assignments, or when clusters overlap significantly. K-Means is faster and simpler, so use it when clusters are roughly spherical and well-separated.

How do I choose the number of components in a GMM?

Use information criteria like BIC or AIC, which balance fit quality against model complexity. Fit GMMs with different numbers of components and pick the one with the lowest BIC. You can also use silhouette scores or domain knowledge. Unlike K-Means, GMM can also use the likelihood to compare models directly.

Can GMM handle clusters of different shapes?

Yes. Each component in a GMM has its own covariance matrix, which can model elliptical clusters at any orientation. By contrast, K-Means implicitly assumes spherical clusters. You can also constrain the covariance matrices (diagonal, spherical, tied) to reduce the number of parameters when data is limited.

Why does K-Means sometimes give poor results?

K-Means fails when clusters are non-spherical, have very different sizes, or overlap heavily. It is also sensitive to initialisation (solved partially by K-Means++) and to outliers, which can pull centroids away from the true cluster centres. If K-Means gives poor results, GMM or density-based methods like DBSCAN may be more appropriate.

Top comments (0)