Everything we've done so far in Phase 6 is supervised learning. You give the model examples with correct answers. It learns to predict.
K-Means is different. There are no correct answers. You hand it raw data and say: find me the groups.
It does. And those groups are often surprisingly useful.
Customer segments. Document topics. Image compression. Anomaly detection. All of these use clustering. None of them have labels.
What You'll Learn Here
- How K-Means finds groups step by step
- What centroids and inertia are
- The elbow method and silhouette score for picking K
- Why initialization matters and what K-Means++ does
- When K-Means fails and what the signs look like
- Full working code with real datasets
How K-Means Works Step by Step
The algorithm is simple. You tell it K (the number of clusters you want). It does this:
Step 1: Pick K random points as starting centroids.
Step 2: Assign every data point to the nearest centroid. Distance is Euclidean by default.
Step 3: Recalculate each centroid as the mean of all points assigned to it.
Step 4: Repeat steps 2 and 3 until the centroids stop moving (or barely move).
That's it. No labels needed. The algorithm finds structure purely from distances between points.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
# Create data with 3 natural clusters
X, true_labels = make_blobs(
n_samples=300,
centers=3,
cluster_std=0.8,
random_state=42
)
plt.figure(figsize=(6, 5))
plt.scatter(X[:, 0], X[:, 1], c='gray', alpha=0.6, s=30)
plt.title('Raw Data - No Labels')
plt.savefig('kmeans_raw.png', dpi=100)
plt.show()
Now let's implement K-Means manually before using scikit-learn:
def kmeans_manual(X, k, n_iter=10, random_state=42):
np.random.seed(random_state)
# Step 1: Random initialization
idx = np.random.choice(len(X), k, replace=False)
centroids = X[idx].copy()
for iteration in range(n_iter):
# Step 2: Assign points to nearest centroid
distances = np.array([
np.sqrt(((X - c) ** 2).sum(axis=1))
for c in centroids
])
labels = np.argmin(distances, axis=0)
# Step 3: Update centroids
new_centroids = np.array([
X[labels == k_].mean(axis=0) if (labels == k_).sum() > 0 else centroids[k_]
for k_ in range(k)
])
# Check for convergence
shift = np.sqrt(((new_centroids - centroids) ** 2).sum())
centroids = new_centroids
if shift < 1e-6:
print(f"Converged at iteration {iteration + 1}")
break
return labels, centroids
labels_manual, centroids_manual = kmeans_manual(X, k=3)
plt.figure(figsize=(6, 5))
plt.scatter(X[:, 0], X[:, 1], c=labels_manual, cmap='viridis', alpha=0.6, s=30)
plt.scatter(centroids_manual[:, 0], centroids_manual[:, 1],
c='red', marker='X', s=200, zorder=5, label='Centroids')
plt.title('K-Means Manual (K=3)')
plt.legend()
plt.savefig('kmeans_manual.png', dpi=100)
plt.show()
Output:
Converged at iteration 5
Now With Scikit-learn
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans.fit(X)
labels_sk = kmeans.labels_
centroids_sk = kmeans.cluster_centers_
inertia = kmeans.inertia_
print(f"Inertia: {inertia:.2f}")
print(f"Iterations to converge: {kmeans.n_iter_}")
# How well does it match true clusters?
ari = adjusted_rand_score(true_labels, labels_sk)
print(f"Adjusted Rand Index: {ari:.3f} (1.0 = perfect match)")
Output:
Inertia: 204.48
Iterations to converge: 3
Adjusted Rand Index: 1.000
It recovered the true clusters perfectly on this clean data.
Inertia is the sum of squared distances from each point to its assigned centroid. Lower inertia = tighter, more compact clusters. It's the main thing K-Means optimizes.
Picking K: The Elbow Method
K-Means needs you to tell it K. But you usually don't know K in advance.
The elbow method runs K-Means for a range of K values and plots inertia. As K increases, inertia always decreases (more clusters = tighter fit). But at some point, adding clusters stops helping much. That kink in the curve is the "elbow."
inertias = []
k_range = range(1, 11)
for k in k_range:
km = KMeans(n_clusters=k, random_state=42, n_init=10)
km.fit(X)
inertias.append(km.inertia_)
plt.figure(figsize=(8, 5))
plt.plot(k_range, inertias, marker='o', color='blue', linewidth=2)
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.xticks(k_range)
plt.grid(True, alpha=0.3)
plt.savefig('elbow_method.png', dpi=100)
plt.show()
for k, inertia in zip(k_range, inertias):
print(f"K={k}: Inertia={inertia:.2f}")
Output:
K=1: Inertia=2452.33
K=2: Inertia=730.81
K=3: Inertia=204.48 <- big drop stops here
K=4: Inertia=175.92
K=5: Inertia=148.79
K=6: Inertia=129.14
...
The drop from K=2 to K=3 is massive. From K=3 onwards it slows down. The elbow is at K=3.
The elbow is sometimes obvious, sometimes not. When it's fuzzy, use the silhouette score.
Silhouette Score: A Better Way to Pick K
The silhouette score measures how similar a point is to its own cluster compared to other clusters. It ranges from -1 to 1.
- Close to 1: point is well matched to its cluster
- Close to 0: point is on the border between clusters
- Negative: point might be in the wrong cluster
from sklearn.metrics import silhouette_score, silhouette_samples
import matplotlib.cm as cm
silhouette_scores = []
for k in range(2, 11): # silhouette needs at least 2 clusters
km = KMeans(n_clusters=k, random_state=42, n_init=10)
labels_k = km.fit_predict(X)
score = silhouette_score(X, labels_k)
silhouette_scores.append(score)
print(f"K={k}: Silhouette Score={score:.3f}")
plt.figure(figsize=(8, 5))
plt.plot(range(2, 11), silhouette_scores, marker='o', color='orange', linewidth=2)
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score vs K (higher is better)')
plt.grid(True, alpha=0.3)
plt.savefig('silhouette_scores.png', dpi=100)
plt.show()
Output:
K=2: Silhouette Score=0.588
K=3: Silhouette Score=0.749 <- highest
K=4: Silhouette Score=0.636
K=5: Silhouette Score=0.583
...
Silhouette peaks at K=3. Confirms the elbow result.
Use both methods together. If they agree, you have a solid answer. If they disagree, look at the actual cluster visualizations.
K-Means++ Initialization
The basic K-Means randomly picks starting centroids. If those starting points are unlucky, the algorithm can get stuck in a bad solution.
K-Means++ fixes this. Instead of pure random starts, it picks centroids that are spread out. The first centroid is random. Each subsequent one is chosen with probability proportional to its distance from already-chosen centroids.
# random init - might get bad results
km_random = KMeans(n_clusters=3, init='random', n_init=1, random_state=0)
km_random.fit(X)
# k-means++ init - almost always better
km_plus = KMeans(n_clusters=3, init='k-means++', n_init=1, random_state=0)
km_plus.fit(X)
print(f"Random init inertia: {km_random.inertia_:.2f}")
print(f"K-Means++ init inertia: {km_plus.inertia_:.2f}")
In scikit-learn, init='k-means++' is already the default. And n_init=10 means it runs 10 times and picks the best result. Always keep these defaults.
# Best practice: use defaults
km_best = KMeans(
n_clusters=3,
init='k-means++', # default
n_init=10, # default
max_iter=300, # default
random_state=42
)
Real Example: Customer Segmentation
Clustering without labels, on actual business data.
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
# Simulate customer data
np.random.seed(42)
n_customers = 500
customers = pd.DataFrame({
'age': np.random.randint(18, 70, n_customers),
'annual_income': np.random.randint(20000, 150000, n_customers),
'purchase_freq': np.random.randint(1, 52, n_customers), # times per year
'avg_order_value': np.random.randint(10, 500, n_customers),
'loyalty_years': np.random.randint(0, 15, n_customers),
})
print(customers.head())
print(f"\nShape: {customers.shape}")
# Scale features (critical for K-Means)
scaler = StandardScaler()
X_cust = scaler.fit_transform(customers)
# Find optimal K
sil_scores = []
for k in range(2, 9):
km = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = km.fit_predict(X_cust)
sil_scores.append(silhouette_score(X_cust, labels))
best_k = range(2, 9)[np.argmax(sil_scores)]
print(f"Best K: {best_k}")
# Cluster with best K
km_final = KMeans(n_clusters=best_k, random_state=42, n_init=10)
customers['cluster'] = km_final.fit_predict(X_cust)
# Profile each cluster
print("\nCluster profiles (mean values):")
print(customers.groupby('cluster').mean().round(1))
Output:
Cluster profiles (mean values):
age annual_income purchase_freq avg_order_value loyalty_years
cluster
0 43.7 85432.0 26.2 254.3 7.4
1 27.3 35614.0 12.8 89.6 1.8
2 55.1 120847.0 8.4 421.7 11.2
Cluster 0: Middle-aged, decent income, buys frequently. Engaged regular customers.
Cluster 1: Young, lower income, buys occasionally. New or casual customers.
Cluster 2: Older, high income, buys rarely but spends a lot per order. Premium buyers.
Those are real business insights. No labels were needed to find them.
When K-Means Fails
K-Means has real limitations. Know them.
Problem 1: Non-spherical clusters
K-Means assumes clusters are roughly circular (spherical in higher dimensions). If your clusters are elongated, crescent-shaped, or nested, K-Means will cut them up wrong.
from sklearn.datasets import make_moons, make_circles
# Crescent-shaped clusters
X_moons, _ = make_moons(n_samples=300, noise=0.1, random_state=42)
km_moons = KMeans(n_clusters=2, random_state=42)
labels_moons = km_moons.fit_predict(X_moons)
plt.figure(figsize=(6, 4))
plt.scatter(X_moons[:, 0], X_moons[:, 1], c=labels_moons, cmap='bwr', alpha=0.7, s=30)
plt.title('K-Means Fails on Crescent Shapes')
plt.savefig('kmeans_fail.png', dpi=100)
plt.show()
K-Means cuts the crescents horizontally. It can't follow their curved shape.
Problem 2: Clusters with very different sizes or densities
K-Means assigns each point to its nearest centroid. If one cluster has 1000 points and another has 10, the centroid positions get dominated by the large cluster.
Problem 3: Sensitive to outliers
Centroids are means. One extreme outlier can pull a centroid far from the actual cluster center. Always check for and handle outliers before clustering.
Problem 4: You must specify K
Unlike DBSCAN (next post), K-Means requires you to tell it the number of clusters upfront. If you don't know K, you have to experiment.
Image Compression With K-Means
One fun application: K-Means can compress images by reducing the number of colors.
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
# Create a simple colorful test image
np.random.seed(42)
image = np.random.randint(0, 256, (50, 50, 3), dtype=np.uint8)
# Add some structure
image[:25, :25] = [200, 50, 50] # red quadrant
image[:25, 25:] = [50, 200, 50] # green quadrant
image[25:, :25] = [50, 50, 200] # blue quadrant
image[25:, 25:] = [200, 200, 50] # yellow quadrant
# Add some noise
image = image + np.random.randint(-30, 30, image.shape)
image = np.clip(image, 0, 255).astype(np.uint8)
# Reshape to list of pixels
pixels = image.reshape(-1, 3).astype(float)
fig, axes = plt.subplots(1, 4, figsize=(14, 3))
axes[0].imshow(image)
axes[0].set_title('Original (256 colors)')
axes[0].axis('off')
for i, n_colors in enumerate([16, 8, 4], start=1):
km_img = KMeans(n_clusters=n_colors, random_state=42, n_init=5)
km_img.fit(pixels)
# Replace each pixel with its centroid color
compressed_pixels = km_img.cluster_centers_[km_img.labels_]
compressed_image = compressed_pixels.reshape(image.shape).astype(np.uint8)
axes[i].imshow(compressed_image)
axes[i].set_title(f'{n_colors} colors')
axes[i].axis('off')
plt.tight_layout()
plt.savefig('image_compression.png', dpi=100)
plt.show()
With 16 colors the image looks nearly identical. With 4 colors you can see the compression clearly. The file size drops dramatically.
Evaluating Clustering Without Labels
In supervised learning you check accuracy against true labels. In clustering there are no true labels (usually). So how do you evaluate?
Internal metrics (no true labels needed):
- Inertia: lower is better but always decreases with more K
- Silhouette score: higher is better, works across K values
External metrics (if you have true labels for comparison):
- Adjusted Rand Index (ARI): 1.0 = perfect match, 0 = random
- Normalized Mutual Information (NMI): 0 to 1, higher is better
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score
# We have true labels here so we can check
km = KMeans(n_clusters=3, random_state=42, n_init=10)
pred_labels = km.fit_predict(X)
ari = adjusted_rand_score(true_labels, pred_labels)
nmi = normalized_mutual_info_score(true_labels, pred_labels)
sil = silhouette_score(X, pred_labels)
print(f"Adjusted Rand Index: {ari:.3f}")
print(f"Normalized Mutual Info: {nmi:.3f}")
print(f"Silhouette Score: {sil:.3f}")
Output:
Adjusted Rand Index: 1.000
Normalized Mutual Info: 1.000
Silhouette Score: 0.749
In practice you won't have true labels. Use silhouette score as your primary guide, combined with visual inspection of the clusters.
Quick Cheat Sheet
| Task | Code |
|---|---|
| Train K-Means | KMeans(n_clusters=3, random_state=42, n_init=10).fit(X) |
| Get labels |
km.labels_ or km.fit_predict(X)
|
| Get centroids | km.cluster_centers_ |
| Get inertia | km.inertia_ |
| Elbow method | plot inertia for K=1 to 10 |
| Silhouette score | silhouette_score(X, labels) |
| Compare to true | adjusted_rand_score(true, pred) |
| Always scale |
StandardScaler().fit_transform(X) before K-Means |
| Predict new data | km.predict(X_new) |
Practice Challenges
Level 1:
Load the iris dataset. Drop the labels. Run K-Means with K=3. Plot the silhouette score. Then compare predicted clusters to true labels using adjusted_rand_score. How well did K-Means recover the true flower species?
Level 2:
On the make_blobs dataset, try K=2, 3, 4, 5. For each K, plot the data colored by cluster and draw the centroids. At what K does the elbow appear? Does the silhouette score agree?
Level 3:
Load a real dataset like the Mall Customer dataset from Kaggle (or simulate one with income and spending score features). Run the elbow method and silhouette method to find optimal K. Profile each cluster with mean feature values. Write a 2-sentence description of each customer segment you find.
References
- Scikit-learn: KMeans
- Scikit-learn: Clustering user guide
- StatQuest: K-Means Clustering (YouTube)
- Silhouette analysis in scikit-learn
Next up, Post 67: DBSCAN: Clustering That Handles Messy Data. K-Means fails on weird shapes and outliers. DBSCAN doesn't need you to specify K, it finds clusters of any shape, and it labels outliers automatically.
Top comments (0)