Introduction
K-Means is an unsupervised machine learning algorithm. The algorithm divides the data points into k groups (called clusters), where each data point can belong to only one cluster. K-Means aims to group together similar data points into the same cluster, while keeping different clusters as far apart as possible.
Each cluster has a center, which is a data point that represents the center of the cluster. A data point gets added to a cluster whose center is closest to that data point. Distance between points is measured using sum of squared distances method.
Algorithm
- Select the number of clusters, k
- Appoint k data points as cluster centers (either random assignment, or space them as far apart as possible)
- Until cluster assignments do not change, do the following for each data point:
- Calculate the sum of squared distance between it and all the cluster centers.
- Assign the point to the cluster having the closest center.
- Recalculate the center for clusters by taking the average of all data points assigned to that cluster.
Additional Information
- K-Means clustering is highly sensitive to the initially chosen cluster centers. Hence, K-means can be run with different starting cluster centers to get optimum results.
- If you do not know the optimum number of clusters to divide the data, try the algorithm with different values of k and select the best k for which the data gets nicely grouped together.
For further information, please checkout https://stanford.edu/~cpiech/cs221/handouts/kmeans.html (image taken from here)
Top comments (0)