Clustering Algorithms: K-Means

#machinelearning #datascience

Introduction

K-Means is an unsupervised machine learning algorithm. The algorithm divides the data points into k groups (called clusters), where each data point can belong to only one cluster. K-Means aims to group together similar data points into the same cluster, while keeping different clusters as far apart as possible.

Each cluster has a center, which is a data point that represents the center of the cluster. A data point gets added to a cluster whose center is closest to that data point. Distance between points is measured using sum of squared distances method.

Algorithm

Select the number of clusters, k
Appoint k data points as cluster centers (either random assignment, or space them as far apart as possible)
Until cluster assignments do not change, do the following for each data point:
1. Calculate the sum of squared distance between it and all the cluster centers.
2. Assign the point to the cluster having the closest center.
3. Recalculate the center for clusters by taking the average of all data points assigned to that cluster.

Additional Information

K-Means clustering is highly sensitive to the initially chosen cluster centers. Hence, K-means can be run with different starting cluster centers to get optimum results.
If you do not know the optimum number of clusters to divide the data, try the algorithm with different values of k and select the best k for which the data gets nicely grouped together.

For further information, please checkout https://stanford.edu/~cpiech/cs221/handouts/kmeans.html (image taken from here)