What is Clustering: An Introduction

#programming #machinelearning #productivity #tutorial

This article was written by Malik Jahan, a member of Educative's technical content team.

Machine learning has evolved as a panoramic field and is applied across a wide variety of disciplines. The selection and application of most machine learning algorithms primarily depend on the nature of the task and the dataset. If the dataset contains a set of instances or data points that don't have a pre-determined label, then the clustering algorithms are expected to process the data and attempt to extract different patterns. For example, if a bowl contains a mix of balls of different sizes and colors (with no additional information) and the task is to come up with the appropriate number of groups of balls, then this is an example of the task of clustering.

Clustering is an unsupervised learning strategy to group the given set of data points into a number of groups or clusters.

Arranging the data into a reasonable number of clusters helps to extract underlying patterns in the data and transform the raw data into meaningful knowledge. Example application areas include the following:

Pattern recognition
Image segmentation
Profiling users or customers
Categorization of objects into a number of categories or groups
Detection of outliers or noise in a pool of data items

Given a dataset, distribute the data into an appropriate number of clusters. In the literature, there are many clustering algorithms. The next sections explore two popular clustering algorithms.

The k-means clustering algorithm

The k-means clustering algorithm is one of the most commonly used clustering algorithms. It clusters the given data into k clusters. The algorithm expects k to be defined as the input to the algorithm. It's an iterative algorithm and performs the following steps to cluster the given data into k clusters:

Choose k arbitrary centroids representing k clusters (One common way to choose the initial centroids is to designate the first k data points as k centroids.)
Compare each data point with all k centroids and assign them to the closest clusters. An appropriate distance function is used to compute the distance between two data items.
Recompute the centroids based on the new assignment. The mean of the data items of each cluster serves as the centroid.
Keep repeating steps 2 and 3 until there is no change in cluster assignment (or mean of clusters) or an upper limit on the number of iterations is reached.

In order to compute the assignment of a data point in the closest cluster, its distance from all centroids is computed, and the closest cluster is decided. One of the most common distance functions used is Euclidean distance:

where x_i and y_i are the ith parameters of the x and y data instances and n is the number of features in each instance.

These steps are executed on a small dataset step-by-step to form two clusters below:

Here is the k-means algorithm implemented in the same example:

from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt

X = np.array([[1,2],[2,2],[2,3],[8,7],[8,8],[25,80],[23,82],[100,100]])
clustering = KMeans(n_clusters=2).fit(X)
labels = clustering.labels_
colors = ("red","green","blue","pink","magenta","black","yellow")
for i in range(len(X)):
  plt.scatter(X[i][0],X[i][1], c = colors[labels[i]], marker = 'x')
plt.savefig("output/scatter.png")

Output:

Below is a line-by-line explanation of the code:

Line 1: The KMeans class is imported from sklearn.cluster package.
Line 2: The numpy library is imported to initialize a dataset to be used in the program.
Line 3: The matplotlib.pyplot library is imported to visualize the outcomes.
Line 5: X is initialized as an numpy array. It contains eight data items with two features each.
Line 6: The KMeans constructor is configured for $k=2$ and trained on X. The output is stored in the object clustering.
Line 7: Cluster assignment of each data point is extracted from clustering and stored in labels.
Line 8: A vector of colors is initialized and stored in colors.
Lines 9-11: Each data item is plotted in a scatter plot with a color corresponding to its cluster.

Density-based clustering algorithm

When it’s not possible to decide the number of clusters k beforehand, then the k-means clustering algorithm is not a good choice to cluster the data. Another bottleneck of the k-means algorithm is that it doesn't differentiate the noisy data points or outliers from the other data points.

Density-based clustering doesn't expect k as one of its input parameters. Instead, it clusters the given data based on the proximity (density) of the data points. One of the commonly used density-based clustering algorithms is DBSCAN (density-based spatial clustering of applications with noise). The algorithm expects the threshold eps to define the neighborhood of a data point and min_samples as the minimum acceptable size of a cluster. Data points that fall out of the eps neighborhood and don't make a cluster of the smallest possible size min_samples are treated as noisy data points or outliers.

Here is a walkthrough of the DBSCAN algorithm step-by-step:

Here is the DBSCAN algorithm implemented in the same example:

from sklearn.cluster import DBSCAN
import numpy as np
import matplotlib.pyplot as plt

X = np.array([[1,2],[2,2],[2,3],[8,7],[8,8],[25,80],[23,82],[100,100]])
clustering = DBSCAN(eps=3, min_samples=2).fit(X)
labels = clustering.labels_
colors = ("red","green","blue","pink")
for i in range(len(X)):
  plt.scatter(X[i][0],X[i][1], c = colors[labels[i]], marker = 'x')
plt.savefig("output/scatter.png")

Output:

Let's go through the code line by line:

Line 1: Import the DBSCAN class from sklearn.cluster package.
Line 2: The numpy library is imported to initialize a dataset to be used in the program.
Line 3: matplotlib.pyplot library is imported to visualize the outcomes.
Line 5: The X is initialized as an numpy array. It contains eight data items with two features each.
Line 6: The DBSCAN constructor is configured for $eps=3$ and min_samples=2 and trained on X. The output is stored in the object clustering.
Line 7: Cluster assignment of each data point is extracted from clustering and stored in labels.
Line 8: A vector of colors is initialized and stored in colors.
Lines 9-11: Each data item is plotted in a scatter plot with a color corresponding to its cluster.