Introduction
Clustering is one of the most powerful techniques in unsupervised machine learning. Among all clustering algorithms, K-Means clustering remains one of the simplest, fastest, and most widely used methods for discovering patterns in unlabelled data.
Unlike supervised learning, clustering does not rely on predefined labels. Instead, it attempts to identify natural groupings within the dataset. While K-Means is often introduced as a beginner-friendly algorithm, applying it effectively requires a deep understanding of its assumptions and limitations.
In this article, we will explore:
The origins of K-Means clustering
Its core assumptions
How the algorithm works
Implementation in R
Real-life applications and industry case studies
How to choose the optimal number of clusters
The Origins of K-Means Clustering
K-Means clustering traces its origins back to the mid-20th century. The algorithm was first proposed by Stuart Lloyd in 1957 while working at Bell Labs. His work was later published in 1982. Independently, similar approaches were introduced by Edward Forgy in 1965, which is why the initialization step is sometimes called the "Forgy method."
Over time, K-Means became popular due to:
Computational efficiency
Mathematical simplicity
Scalability to large datasets
Interpretability of results
Today, K-Means is implemented in virtually every statistical and machine learning software package, including R, Python, SAS, and MATLAB.
Core Assumptions of K-Means Clustering
K-Means works well when certain assumptions about the data are satisfied.
1. Clusters Are Spherical
K-Means uses distance (usually Euclidean distance) to assign data points to the nearest cluster center. This implicitly assumes that clusters are spherical or roughly circular in shape.
If clusters are elongated, crescent-shaped, or concentric, K-Means may fail to identify them correctly.
2. Clusters Are of Similar Size
K-Means minimizes within-cluster variance. When one cluster is significantly smaller than another, the algorithm may incorrectly assign some points from the larger cluster into the smaller one to optimize variance.
3. Features Are Numeric and Scaled
Since the algorithm relies on distance, variables must be numeric and ideally standardized to prevent scale dominance.
How K-Means Works: Step-by-Step
The algorithm follows an iterative process:
Choose the number of clusters, K.
Initialize K cluster centers (randomly or strategically).
Assign each data point to the nearest cluster center.
Recalculate cluster centers as the mean of assigned points.
Repeat steps 3–4 until assignments no longer change.
The objective function minimized by K-Means is:
Within-Cluster Sum of Squares (WCSS)
This measures the compactness of clusters.
Example 1: Clustering the Faithful Dataset in R
R provides a built-in dataset called faithful, which contains eruption duration and waiting times for the Old Faithful geyser.
4
When we visualize the dataset:
plot(faithful)
We can clearly see two clusters.
Applying K-Means:
k_clust_start = kmeans(faithful, centers = 2)
plot(faithful, col = k_clust_start$cluster, pch = 2)
The algorithm successfully identifies:
Short eruptions with shorter waiting times
Long eruptions with longer waiting times
Cluster centers can be extracted using:
k_clust_start$centers
k_clust_start$size
This example works well because the assumptions of spherical and similar-sized clusters are satisfied.
When Assumptions Break: Concentric Circles Problem
Consider a dataset shaped like two concentric circles.
4
Here:
The inner cluster is circular.
The outer cluster surrounds it.
Even though two clusters clearly exist, K-Means fails because the outer cluster is not spherical in Euclidean space.
Solution: Data Transformation
By transforming Cartesian coordinates into polar coordinates:
cart2pol=function(x,y){
newx=sqrt(x^2 + y^2)
newy=atan(y/x)
cbind(newx,newy)
}
After transformation, clusters become separable in radial space.
This highlights a critical lesson:
Data preprocessing can determine clustering success.
Uneven Cluster Sizes Problem
Now imagine:
1000 data points around (0,0)
10 tightly grouped points around (5,5)
4
Although two clusters exist, K-Means may incorrectly split the large cluster to minimize total variance.
This demonstrates that:
K-Means prefers balanced partitions rather than true structural separation.
Choosing the Right K: The Elbow Method
Determining K is one of the most challenging aspects of clustering.
The Elbow Method plots:
Number of clusters (K)
Total Within-Cluster Sum of Squares (WCSS)
Example using iris:
sse=vector('numeric')
for(i in 2:15){
sse[i-1]=sum(kmeans(iris[,3:4],centers=i)$withinss)
}
Plotting K vs SSE often reveals an “elbow” point — where marginal improvement drops sharply.
For the iris dataset, K = 3 aligns with the known species categories.
Real-Life Applications of K-Means Clustering
K-Means is widely used across industries.
- Customer Segmentation (Retail & E-commerce) Use Case: Segment customers based on purchasing behavior.
Features may include:
Purchase frequency
Average transaction value
Recency
Product categories
Case Study:
An online retailer used K-Means to segment 500,000 customers into:
High-value loyal customers
Discount-driven buyers
One-time shoppers
The marketing team tailored campaigns for each segment, increasing retention by 18%.
2. Healthcare: Patient Risk Stratification
Hospitals use clustering to group patients based on:
Age
Medical history
Lab results
Hospital visits
Case Study:
A healthcare provider applied K-Means to identify high-risk chronic patients. Targeted preventive care programs reduced emergency admissions by 12%.
3. Banking & Fraud Detection
Banks cluster transactions to identify unusual patterns.
Although K-Means is not a complete fraud detection system, it helps:
Detect anomaly groups
Segment risk profiles
Analyse transaction behaviours
4. Image Compression
K-Means reduces the number of distinct colors in an image.
Each pixel is assigned to the nearest cluster center (color centroid), reducing storage while preserving visual quality.
This technique is widely used in graphics optimization.
5. Telecom: Network Optimization
Telecom companies cluster:
Call data records
Tower traffic
Geographic usage
This helps in:
Infrastructure planning
Capacity forecasting
Identifying congestion zones
6. AI Consulting & Business Intelligence
In enterprise AI consulting projects, K-Means is frequently used to:
Identify behavioral clusters in unlabeled data
Segment markets
Discover product usage patterns
Improve operational efficiency
It often acts as the first exploratory step before building predictive models.
Strengths and Limitations
Strengths
Simple and intuitive
Computationally efficient
Works well with large datasets
Easy to interpret
Limitations
Requires pre-defined K
Sensitive to initialization
Struggles with non-spherical clusters
Sensitive to outliers
Assumes similar cluster sizes
Best Practices for Using K-Means in R
Scale variables using scale()
Use multiple random starts (nstart parameter)
Visualize clusters
Validate results with silhouette scores
Combine with domain knowledge
Example:
kmeans(data, centers=3, nstart=25)
Conclusion
K-Means clustering is one of the most fundamental algorithms in machine learning. Originating in the 1950s, it remains relevant due to its simplicity and scalability.
However, successful implementation requires:
Understanding its assumptions
Proper data preprocessing
Careful selection of K
Awareness of limitations
When used correctly, K-Means can uncover powerful insights in customer segmentation, healthcare analytics, fraud detection, telecom optimization, and AI-driven business intelligence.
Unsupervised learning is not guesswork. It requires thoughtful application, careful validation, and a deep understanding of the algorithm’s mechanics.
K-Means may be simple—but mastering it is what transforms raw data into actionable intelligence.
This article was originally published on Perceptive Analytics.
At Perceptive Analytics our mission is “to enable businesses to unlock value in data.” For over 20 years, we’ve partnered with more than 100 clients—from Fortune 500 companies to mid-sized firms—to solve complex data analytics challenges. Our services include AI Consulting in San Francisco, AI Consulting in San Jose, and AI Consulting in Seattle turning data into strategic insight. We would love to talk to you. Do reach out to us.
Top comments (0)