DEV Community

MustafaLSailor
MustafaLSailor

Posted on

k-means

K-Means is one of the most popular clustering algorithms. K-Means, an unsupervised learning algorithm, groups similar data points in a data set together into groups or โ€œclusters.โ€ This algorithm discovers hidden structures or groups in the data set.

Here is a detailed explanation of the K-Means algorithm:

Initially, K cluster centers (or "centroids") are randomly selected. K is a user-specified parameter and specifies the number of clusters to be found in the data set.

Each data point is assigned to the cluster center closest to it. This is usually done using a distance measure such as the Euclidean distance.

The center of each cluster is updated by averaging all data points in that cluster.

These two steps (assigning data points and updating centers) are repeated until the centers remain unchanged or until a specified number of iterations is reached.

At the end of the algorithm, each data point is assigned to a cluster.

Advantages of the K-Means algorithm include simplicity and speed. However, it also has some disadvantages. For example, K needs to be determined from the beginning and this may not always be easy. Additionally, the algorithm is sensitive to random starting points, which can lead to different results in different runs.

These are a general summary of the K-Means algorithm. For more details, you can look at the mathematical details of the algorithm and optimization techniques.

Random Initialization Trap

Image description

Image description

The random start trap in the K-Means algorithm is the situation where the algorithm is initially sensitive to randomly selected cluster centers (centroids). That is, the location of the initially selected centers can significantly affect the results of the algorithm.

For example, if you choose different starting centers in two different runs, the algorithm may generate different clusters in both cases. This means that K-Means is a non-deterministic algorithm โ€“ it can be run multiple times on the same data set and get different results each time.

One way to avoid this trap is to use improved initial strategies such as K-Means++. The K-Means++ algorithm chooses the starting centers more intelligently, so that the random start trap of K-Means can be avoided.

Another method is to run the K-Means algorithm multiple times (e.g. 100 or 1000 times) and choose the solution that gives the lowest total sum of squares (or other error measure). This can reduce the effects of the random start trap, but the computational cost is higher.

How can we decide the number of clusters and segments in k-means?

Determining the number of clusters in the K-Means algorithm often poses a challenge because it is often data dependent and there is often no obvious answer. However, there are several methods that can be used to determine the number of clusters:

Elbow Method: In this method, the total error squared (or within-cluster variance) is calculated for different cluster numbers. As the number of clusters increases, the total squared error generally decreases. However, at some point this rate of decrease drops significantly - this is called the "elbow" and is generally considered a good cluster number.

Silhouette Coefficient: This metric measures how well clusters are separated. For each data point, a score is calculated based on the distance between the cluster in which that point is located and the other closest cluster. The silhouette coefficients for all data points are averaged and this value is maximized.

Gap Statistic: This method compares the within-cluster variance of real data with the within-cluster variance of random data. As the number of clusters increases, the "gap" between these two values generally decreases. The number of clusters with the largest gap is generally considered a good choice.

None of these methods are perfect, and each may work better for certain types of data or cluster structures. Also, you can always use trial and error in practice and see which number of clusters works best for the problem.

wcss

WCSS is an abbreviation for "Within-Cluster Sum of Square". It is a criterion used in clustering algorithms such as K-Means.

WCSS represents the sum of the squares of the (Euclidean) distance of each data point in a cluster from the center of that cluster. This measures how close together data points within a cluster are (or how close to the center).

Generally, the aim of clustering algorithms is to minimize the WCSS value. That is, to ensure that the data points within each cluster are as close to the center as possible. This ensures that data points are as similar as possible within their clusters (and as different as possible from other clusters).

WCSS is used to determine the number of clusters, especially in methods such as the Elbow Method. WCSS generally decreases as the number of clusters increases, but at some point this rate of decrease drops significantly - this is called the "elbow" and is generally considered a good number of clusters.

Image description

Image description

Image description

As can be seen here, the wcss value decreases as the number of clusters increases.

Image description

We can see the elbow note here too. There is a sudden break in the 3rd grade, this is the elbow point.

Top comments (0)