This article was written by Malik Jahan, a member of Educative's technical content team.
Machine learning has evolved as a panoramic field and is applied across a wide variety of disciplines. The selection and application of most machine learning algorithms primarily depend on the nature of the task and the dataset. If the dataset contains a set of instances or data points that don't have a pre-determined label, then the clustering algorithms are expected to process the data and attempt to extract different patterns. For example, if a bowl contains a mix of balls of different sizes and colors (with no additional information) and the task is to come up with the appropriate number of groups of balls, then this is an example of the task of clustering.
Clustering is an unsupervised learning strategy to group the given set of data points into a number of groups or clusters.
Arranging the data into a reasonable number of clusters helps to extract underlying patterns in the data and transform the raw data into meaningful knowledge. Example application areas include the following:
- Pattern recognition
- Image segmentation
- Profiling users or customers
- Categorization of objects into a number of categories or groups
- Detection of outliers or noise in a pool of data items
Given a dataset, distribute the data into an appropriate number of clusters. In the literature, there are many clustering algorithms. The next sections explore two popular clustering algorithms.
The k-means clustering algorithm
The k-means clustering algorithm is one of the most commonly used clustering algorithms. It clusters the given data into k clusters. The algorithm expects k to be defined as the input to the algorithm. It's an iterative algorithm and performs the following steps to cluster the given data into k clusters:
- Choose k arbitrary centroids representing k clusters (One common way to choose the initial centroids is to designate the first k data points as k centroids.)
- Compare each data point with all k centroids and assign them to the closest clusters. An appropriate distance function is used to compute the distance between two data items.
- Recompute the centroids based on the new assignment. The mean of the data items of each cluster serves as the centroid.
- Keep repeating steps 2 and 3 until there is no change in cluster assignment (or mean of clusters) or an upper limit on the number of iterations is reached.
In order to compute the assignment of a data point in the closest cluster, its distance from all centroids is computed, and the closest cluster is decided. One of the most common distance functions used is Euclidean distance:
where x_i and y_i are the ith parameters of the x and y data instances and n is the number of features in each instance.
These steps are executed on a small dataset step-by-step to form two clusters below:
Here is the k-means algorithm implemented in the same example:
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
X = np.array([[1,2],[2,2],[2,3],[8,7],[8,8],[25,80],[23,82],[100,100]])
clustering = KMeans(n_clusters=2).fit(X)
labels = clustering.labels_
colors = ("red","green","blue","pink","magenta","black","yellow")
for i in range(len(X)):
plt.scatter(X[i][0],X[i][1], c = colors[labels[i]], marker = 'x')
plt.savefig("output/scatter.png")
Output:
Below is a line-by-line explanation of the code:
-
Line 1: The
KMeans
class is imported fromsklearn.cluster
package. -
Line 2: The
numpy
library is imported to initialize a dataset to be used in the program. -
Line 3: The
matplotlib.pyplot
library is imported to visualize the outcomes. -
Line 5:
X
is initialized as annumpy
array. It contains eight data items with two features each. -
Line 6: The
KMeans
constructor is configured for $k=2$ and trained onX
. The output is stored in the objectclustering
. -
Line 7: Cluster assignment of each data point is extracted from
clustering
and stored inlabels
. -
Line 8: A vector of colors is initialized and stored in
colors
. - Lines 9-11: Each data item is plotted in a scatter plot with a color corresponding to its cluster.
Density-based clustering algorithm
When it’s not possible to decide the number of clusters k beforehand, then the k-means clustering algorithm is not a good choice to cluster the data. Another bottleneck of the k-means algorithm is that it doesn't differentiate the noisy data points or outliers from the other data points.
Density-based clustering doesn't expect k as one of its input parameters. Instead, it clusters the given data based on the proximity (density) of the data points. One of the commonly used density-based clustering algorithms is DBSCAN (density-based spatial clustering of applications with noise). The algorithm expects the threshold eps to define the neighborhood of a data point and min_samples as the minimum acceptable size of a cluster. Data points that fall out of the eps neighborhood and don't make a cluster of the smallest possible size min_samples are treated as noisy data points or outliers.
Here is a walkthrough of the DBSCAN algorithm step-by-step:
Here is the DBSCAN algorithm implemented in the same example:
from sklearn.cluster import DBSCAN
import numpy as np
import matplotlib.pyplot as plt
X = np.array([[1,2],[2,2],[2,3],[8,7],[8,8],[25,80],[23,82],[100,100]])
clustering = DBSCAN(eps=3, min_samples=2).fit(X)
labels = clustering.labels_
colors = ("red","green","blue","pink")
for i in range(len(X)):
plt.scatter(X[i][0],X[i][1], c = colors[labels[i]], marker = 'x')
plt.savefig("output/scatter.png")
Output:
Let's go through the code line by line:
-
Line 1: Import the
DBSCAN
class fromsklearn.cluster
package. -
Line 2: The
numpy
library is imported to initialize a dataset to be used in the program. -
Line 3:
matplotlib.pyplot
library is imported to visualize the outcomes. -
Line 5: The
X
is initialized as annumpy
array. It contains eight data items with two features each. -
Line 6: The
DBSCAN
constructor is configured for $eps=3$ and min_samples=2 and trained onX
. The output is stored in the objectclustering
. -
Line 7: Cluster assignment of each data point is extracted from
clustering
and stored inlabels
. -
Line 8: A vector of colors is initialized and stored in
colors
. - Lines 9-11: Each data item is plotted in a scatter plot with a color corresponding to its cluster.
Feel free to play with the code of both algorithms (particularly the parameters each algorithm expects) and observe their impact on the output.
By now, you probably have a good grasp of the basics of clustering and are ready to start your journey to become a machine learning expert.
For a much deeper dive into clustering and machine learning, explore the following courses:
As always, happy learning!
Top comments (0)