DEV Community

Cover image for 66. K-Means Clustering: Find Groups Without Labels
Akhilesh
Akhilesh

Posted on

66. K-Means Clustering: Find Groups Without Labels

Everything we've done so far in Phase 6 is supervised learning. You give the model examples with correct answers. It learns to predict.

K-Means is different. There are no correct answers. You hand it raw data and say: find me the groups.

It does. And those groups are often surprisingly useful.

Customer segments. Document topics. Image compression. Anomaly detection. All of these use clustering. None of them have labels.


What You'll Learn Here

  • How K-Means finds groups step by step
  • What centroids and inertia are
  • The elbow method and silhouette score for picking K
  • Why initialization matters and what K-Means++ does
  • When K-Means fails and what the signs look like
  • Full working code with real datasets

How K-Means Works Step by Step

The algorithm is simple. You tell it K (the number of clusters you want). It does this:

Step 1: Pick K random points as starting centroids.

Step 2: Assign every data point to the nearest centroid. Distance is Euclidean by default.

Step 3: Recalculate each centroid as the mean of all points assigned to it.

Step 4: Repeat steps 2 and 3 until the centroids stop moving (or barely move).

That's it. No labels needed. The algorithm finds structure purely from distances between points.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# Create data with 3 natural clusters
X, true_labels = make_blobs(
    n_samples=300,
    centers=3,
    cluster_std=0.8,
    random_state=42
)

plt.figure(figsize=(6, 5))
plt.scatter(X[:, 0], X[:, 1], c='gray', alpha=0.6, s=30)
plt.title('Raw Data - No Labels')
plt.savefig('kmeans_raw.png', dpi=100)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Now let's implement K-Means manually before using scikit-learn:

def kmeans_manual(X, k, n_iter=10, random_state=42):
    np.random.seed(random_state)

    # Step 1: Random initialization
    idx = np.random.choice(len(X), k, replace=False)
    centroids = X[idx].copy()

    for iteration in range(n_iter):
        # Step 2: Assign points to nearest centroid
        distances = np.array([
            np.sqrt(((X - c) ** 2).sum(axis=1))
            for c in centroids
        ])
        labels = np.argmin(distances, axis=0)

        # Step 3: Update centroids
        new_centroids = np.array([
            X[labels == k_].mean(axis=0) if (labels == k_).sum() > 0 else centroids[k_]
            for k_ in range(k)
        ])

        # Check for convergence
        shift = np.sqrt(((new_centroids - centroids) ** 2).sum())
        centroids = new_centroids

        if shift < 1e-6:
            print(f"Converged at iteration {iteration + 1}")
            break

    return labels, centroids

labels_manual, centroids_manual = kmeans_manual(X, k=3)

plt.figure(figsize=(6, 5))
plt.scatter(X[:, 0], X[:, 1], c=labels_manual, cmap='viridis', alpha=0.6, s=30)
plt.scatter(centroids_manual[:, 0], centroids_manual[:, 1],
            c='red', marker='X', s=200, zorder=5, label='Centroids')
plt.title('K-Means Manual (K=3)')
plt.legend()
plt.savefig('kmeans_manual.png', dpi=100)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Output:

Converged at iteration 5
Enter fullscreen mode Exit fullscreen mode

Now With Scikit-learn

from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans.fit(X)

labels_sk   = kmeans.labels_
centroids_sk = kmeans.cluster_centers_
inertia      = kmeans.inertia_

print(f"Inertia: {inertia:.2f}")
print(f"Iterations to converge: {kmeans.n_iter_}")

# How well does it match true clusters?
ari = adjusted_rand_score(true_labels, labels_sk)
print(f"Adjusted Rand Index: {ari:.3f}  (1.0 = perfect match)")
Enter fullscreen mode Exit fullscreen mode

Output:

Inertia: 204.48
Iterations to converge: 3
Adjusted Rand Index: 1.000
Enter fullscreen mode Exit fullscreen mode

It recovered the true clusters perfectly on this clean data.

Inertia is the sum of squared distances from each point to its assigned centroid. Lower inertia = tighter, more compact clusters. It's the main thing K-Means optimizes.


Picking K: The Elbow Method

K-Means needs you to tell it K. But you usually don't know K in advance.

The elbow method runs K-Means for a range of K values and plots inertia. As K increases, inertia always decreases (more clusters = tighter fit). But at some point, adding clusters stops helping much. That kink in the curve is the "elbow."

inertias = []
k_range  = range(1, 11)

for k in k_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X)
    inertias.append(km.inertia_)

plt.figure(figsize=(8, 5))
plt.plot(k_range, inertias, marker='o', color='blue', linewidth=2)
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.xticks(k_range)
plt.grid(True, alpha=0.3)
plt.savefig('elbow_method.png', dpi=100)
plt.show()

for k, inertia in zip(k_range, inertias):
    print(f"K={k}: Inertia={inertia:.2f}")
Enter fullscreen mode Exit fullscreen mode

Output:

K=1: Inertia=2452.33
K=2: Inertia=730.81
K=3: Inertia=204.48   <- big drop stops here
K=4: Inertia=175.92
K=5: Inertia=148.79
K=6: Inertia=129.14
...
Enter fullscreen mode Exit fullscreen mode

The drop from K=2 to K=3 is massive. From K=3 onwards it slows down. The elbow is at K=3.

The elbow is sometimes obvious, sometimes not. When it's fuzzy, use the silhouette score.


Silhouette Score: A Better Way to Pick K

The silhouette score measures how similar a point is to its own cluster compared to other clusters. It ranges from -1 to 1.

  • Close to 1: point is well matched to its cluster
  • Close to 0: point is on the border between clusters
  • Negative: point might be in the wrong cluster
from sklearn.metrics import silhouette_score, silhouette_samples
import matplotlib.cm as cm

silhouette_scores = []

for k in range(2, 11):  # silhouette needs at least 2 clusters
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels_k = km.fit_predict(X)
    score = silhouette_score(X, labels_k)
    silhouette_scores.append(score)
    print(f"K={k}: Silhouette Score={score:.3f}")

plt.figure(figsize=(8, 5))
plt.plot(range(2, 11), silhouette_scores, marker='o', color='orange', linewidth=2)
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score vs K (higher is better)')
plt.grid(True, alpha=0.3)
plt.savefig('silhouette_scores.png', dpi=100)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Output:

K=2: Silhouette Score=0.588
K=3: Silhouette Score=0.749   <- highest
K=4: Silhouette Score=0.636
K=5: Silhouette Score=0.583
...
Enter fullscreen mode Exit fullscreen mode

Silhouette peaks at K=3. Confirms the elbow result.

Use both methods together. If they agree, you have a solid answer. If they disagree, look at the actual cluster visualizations.


K-Means++ Initialization

The basic K-Means randomly picks starting centroids. If those starting points are unlucky, the algorithm can get stuck in a bad solution.

K-Means++ fixes this. Instead of pure random starts, it picks centroids that are spread out. The first centroid is random. Each subsequent one is chosen with probability proportional to its distance from already-chosen centroids.

# random init - might get bad results
km_random = KMeans(n_clusters=3, init='random', n_init=1, random_state=0)
km_random.fit(X)

# k-means++ init - almost always better
km_plus = KMeans(n_clusters=3, init='k-means++', n_init=1, random_state=0)
km_plus.fit(X)

print(f"Random init inertia:    {km_random.inertia_:.2f}")
print(f"K-Means++ init inertia: {km_plus.inertia_:.2f}")
Enter fullscreen mode Exit fullscreen mode

In scikit-learn, init='k-means++' is already the default. And n_init=10 means it runs 10 times and picks the best result. Always keep these defaults.

# Best practice: use defaults
km_best = KMeans(
    n_clusters=3,
    init='k-means++',  # default
    n_init=10,         # default
    max_iter=300,      # default
    random_state=42
)
Enter fullscreen mode Exit fullscreen mode

Real Example: Customer Segmentation

Clustering without labels, on actual business data.

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Simulate customer data
np.random.seed(42)
n_customers = 500

customers = pd.DataFrame({
    'age':              np.random.randint(18, 70, n_customers),
    'annual_income':    np.random.randint(20000, 150000, n_customers),
    'purchase_freq':    np.random.randint(1, 52, n_customers),   # times per year
    'avg_order_value':  np.random.randint(10, 500, n_customers),
    'loyalty_years':    np.random.randint(0, 15, n_customers),
})

print(customers.head())
print(f"\nShape: {customers.shape}")
Enter fullscreen mode Exit fullscreen mode
# Scale features (critical for K-Means)
scaler = StandardScaler()
X_cust = scaler.fit_transform(customers)

# Find optimal K
sil_scores = []
for k in range(2, 9):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_cust)
    sil_scores.append(silhouette_score(X_cust, labels))

best_k = range(2, 9)[np.argmax(sil_scores)]
print(f"Best K: {best_k}")

# Cluster with best K
km_final = KMeans(n_clusters=best_k, random_state=42, n_init=10)
customers['cluster'] = km_final.fit_predict(X_cust)

# Profile each cluster
print("\nCluster profiles (mean values):")
print(customers.groupby('cluster').mean().round(1))
Enter fullscreen mode Exit fullscreen mode

Output:

Cluster profiles (mean values):
         age  annual_income  purchase_freq  avg_order_value  loyalty_years
cluster
0       43.7       85432.0           26.2            254.3            7.4
1       27.3       35614.0           12.8             89.6            1.8
2       55.1      120847.0            8.4            421.7           11.2
Enter fullscreen mode Exit fullscreen mode

Cluster 0: Middle-aged, decent income, buys frequently. Engaged regular customers.
Cluster 1: Young, lower income, buys occasionally. New or casual customers.
Cluster 2: Older, high income, buys rarely but spends a lot per order. Premium buyers.

Those are real business insights. No labels were needed to find them.


When K-Means Fails

K-Means has real limitations. Know them.

Problem 1: Non-spherical clusters

K-Means assumes clusters are roughly circular (spherical in higher dimensions). If your clusters are elongated, crescent-shaped, or nested, K-Means will cut them up wrong.

from sklearn.datasets import make_moons, make_circles

# Crescent-shaped clusters
X_moons, _ = make_moons(n_samples=300, noise=0.1, random_state=42)

km_moons = KMeans(n_clusters=2, random_state=42)
labels_moons = km_moons.fit_predict(X_moons)

plt.figure(figsize=(6, 4))
plt.scatter(X_moons[:, 0], X_moons[:, 1], c=labels_moons, cmap='bwr', alpha=0.7, s=30)
plt.title('K-Means Fails on Crescent Shapes')
plt.savefig('kmeans_fail.png', dpi=100)
plt.show()
Enter fullscreen mode Exit fullscreen mode

K-Means cuts the crescents horizontally. It can't follow their curved shape.

Problem 2: Clusters with very different sizes or densities

K-Means assigns each point to its nearest centroid. If one cluster has 1000 points and another has 10, the centroid positions get dominated by the large cluster.

Problem 3: Sensitive to outliers

Centroids are means. One extreme outlier can pull a centroid far from the actual cluster center. Always check for and handle outliers before clustering.

Problem 4: You must specify K

Unlike DBSCAN (next post), K-Means requires you to tell it the number of clusters upfront. If you don't know K, you have to experiment.


Image Compression With K-Means

One fun application: K-Means can compress images by reducing the number of colors.

from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt

# Create a simple colorful test image
np.random.seed(42)
image = np.random.randint(0, 256, (50, 50, 3), dtype=np.uint8)
# Add some structure
image[:25, :25] = [200, 50, 50]   # red quadrant
image[:25, 25:] = [50, 200, 50]   # green quadrant
image[25:, :25] = [50, 50, 200]   # blue quadrant
image[25:, 25:] = [200, 200, 50]  # yellow quadrant

# Add some noise
image = image + np.random.randint(-30, 30, image.shape)
image = np.clip(image, 0, 255).astype(np.uint8)

# Reshape to list of pixels
pixels = image.reshape(-1, 3).astype(float)

fig, axes = plt.subplots(1, 4, figsize=(14, 3))
axes[0].imshow(image)
axes[0].set_title('Original (256 colors)')
axes[0].axis('off')

for i, n_colors in enumerate([16, 8, 4], start=1):
    km_img = KMeans(n_clusters=n_colors, random_state=42, n_init=5)
    km_img.fit(pixels)

    # Replace each pixel with its centroid color
    compressed_pixels = km_img.cluster_centers_[km_img.labels_]
    compressed_image  = compressed_pixels.reshape(image.shape).astype(np.uint8)

    axes[i].imshow(compressed_image)
    axes[i].set_title(f'{n_colors} colors')
    axes[i].axis('off')

plt.tight_layout()
plt.savefig('image_compression.png', dpi=100)
plt.show()
Enter fullscreen mode Exit fullscreen mode

With 16 colors the image looks nearly identical. With 4 colors you can see the compression clearly. The file size drops dramatically.


Evaluating Clustering Without Labels

In supervised learning you check accuracy against true labels. In clustering there are no true labels (usually). So how do you evaluate?

Internal metrics (no true labels needed):

  • Inertia: lower is better but always decreases with more K
  • Silhouette score: higher is better, works across K values

External metrics (if you have true labels for comparison):

  • Adjusted Rand Index (ARI): 1.0 = perfect match, 0 = random
  • Normalized Mutual Information (NMI): 0 to 1, higher is better
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score

# We have true labels here so we can check
km = KMeans(n_clusters=3, random_state=42, n_init=10)
pred_labels = km.fit_predict(X)

ari = adjusted_rand_score(true_labels, pred_labels)
nmi = normalized_mutual_info_score(true_labels, pred_labels)
sil = silhouette_score(X, pred_labels)

print(f"Adjusted Rand Index:        {ari:.3f}")
print(f"Normalized Mutual Info:     {nmi:.3f}")
print(f"Silhouette Score:           {sil:.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Adjusted Rand Index:        1.000
Normalized Mutual Info:     1.000
Silhouette Score:           0.749
Enter fullscreen mode Exit fullscreen mode

In practice you won't have true labels. Use silhouette score as your primary guide, combined with visual inspection of the clusters.


Quick Cheat Sheet

Task Code
Train K-Means KMeans(n_clusters=3, random_state=42, n_init=10).fit(X)
Get labels km.labels_ or km.fit_predict(X)
Get centroids km.cluster_centers_
Get inertia km.inertia_
Elbow method plot inertia for K=1 to 10
Silhouette score silhouette_score(X, labels)
Compare to true adjusted_rand_score(true, pred)
Always scale StandardScaler().fit_transform(X) before K-Means
Predict new data km.predict(X_new)

Practice Challenges

Level 1:
Load the iris dataset. Drop the labels. Run K-Means with K=3. Plot the silhouette score. Then compare predicted clusters to true labels using adjusted_rand_score. How well did K-Means recover the true flower species?

Level 2:
On the make_blobs dataset, try K=2, 3, 4, 5. For each K, plot the data colored by cluster and draw the centroids. At what K does the elbow appear? Does the silhouette score agree?

Level 3:
Load a real dataset like the Mall Customer dataset from Kaggle (or simulate one with income and spending score features). Run the elbow method and silhouette method to find optimal K. Profile each cluster with mean feature values. Write a 2-sentence description of each customer segment you find.


References


Next up, Post 67: DBSCAN: Clustering That Handles Messy Data. K-Means fails on weird shapes and outliers. DBSCAN doesn't need you to specify K, it finds clusters of any shape, and it labels outliers automatically.

Top comments (0)